在dataframe中查找列表的所有元素的位置有多低

2024-09-26 22:51:19 发布

您现在位置:Python中文网/ 问答频道 /正文

我有一份清单:

elements = ['a', 'b', 'c', 'd']

以及包含列表中部分或全部元素的数据帧:

       mycol
0      a
1      x
2      y
3      e
4      b
5      c
6      o
7      l
8      s
9      d
10     g

我想知道我必须在df上搜索多少才能找到列表中的所有元素。在本例中,答案将是10,因为直到找到列表中的所有元素为止

谢谢


Tags: 数据答案元素df列表elements本例mycol
3条回答

这是值得考虑的。我无法用更大的测试数据得到更奇特的索引答案,但Barmar的循环应该是可靠的:

Just loop over the dataframe indexes. If the current df element is in the list, remove it from the list. When the list becomes empty, the current index is the answer.

def idxall(series, elements):
    for i, e in enumerate(series.to_numpy()): # faster than series.items()
        if e in elements:
            elements.remove(e)
            if not elements:
                return i + 1
    return np.nan

计时

给定df = pd.DataFrame({'mycol': np.random.choice(list(string.ascii_lowercase), size=1000)})

%timeit tdy_idxall(df.mycol, list(string.ascii_lowercase))
# 21.4 µs ± 7.44 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
%timeit henry_ecker_np_unique(df.mycol, list(string.ascii_lowercase))
# 379 µs ± 48.1 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%timeit u12_forward_idxmax(df.mycol, list(string.ascii_lowercase)
# 538 µs ± 61.7 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%timeit corralien_idxall(df.mycol, list(string.ascii_lowercase))
# 1.28 ms ± 243 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

验证

  • 使用OP的样本:

    df = pd.DataFrame({'mycol': list('axyebcolsdg')})
    elements = list('abcd')
    
    idxall(df.mycol, elements)
    # 10
    
  • 使用Henry的样本#1(混合顺序和重复):

    df = pd.DataFrame({'mycol': list('dxcabcodsdg')})
    elements = list('abcd')
    
    idxall(df.mycol, elements)
    # 5
    
  • 使用Henry的样本#2(未找到所有元素):

    df = pd.DataFrame({'mycol': list('dxcabcodsdg')})
    elements = list('abcz')
    
    idxall(df.mycol, elements)
    # nan
    

我们可以使用^{}return_index=True来查找每个唯一值的第一个实例:

import numpy as np
import pandas as pd

elements = ['a', 'b', 'c', 'd']
df = pd.DataFrame({
    'mycol': ['a', 'x', 'y', 'e', 'b', 'c', 'o', 'l', 's', 'd', 'g']
})

# Find the first location where each unique value is found
a, b = np.unique(df['mycol'], return_index=True)
# Compare unique values to values we're looking for
m = (a == np.array(elements)[:, None])
# If we have a location for all elements
if m.any(axis=1).all():
    # Find the highest index value
    max_index = b[m.any(axis=0)].max()
    # Offset index by one to match expected output
    print('All values found by', max_index + 1)
else:
    # We couldn't find all elements
    print('Not all elements found.')
All values found by 10

具有混合顺序和重复项的示例:

elements = ['a', 'b', 'c', 'd']
df = pd.DataFrame({
    'mycol': ['d', 'x', 'c', 'a', 'b', 'c', 'o', 'd', 's', 'd', 'g']
})
   mycol
0      d
1      x
2      c
3      a
4      b
5      c
6      o
7      d
8      s
9      d
10     g
All values found by 5

未找到所有元素的示例:

elements = ['a', 'b', 'c', 'z']
df = pd.DataFrame({
    'mycol': ['d', 'x', 'c', 'a', 'b', 'c', 'o', 'd', 's', 'd', 'g']
})
   mycol
0      d
1      x
2      c
3      a
4      b
5      c
6      o
7      d
8      s
9      d
10     g
Not all elements found.  # (No z)

试试idxmax

>>> df['mycol'].isin(elements)[::-1].idxmax()
9
>>> 

编辑:

要指定数据框中元素的所有值,请尝试:

x = df['mycol'].drop_duplicates().isin(elements).cumsum().eq(len(elements))
if x.any():
    print(x.idxmax())
else:
    print("Not all values are in the dataframe")

对于当前数据帧:

9

对于并非所有值都在数据帧中的数据帧:

Not all values are in the dataframe

相关问题 更多 >

    热门问题