在组中包含的字符串中基于条件

A B C 0 2002-01-12 Sarah 39 1 2002-01-12 John 17 2 2002-01-12 Susan 30 3 2002-01-15 Danny 12 4 2002-01-15 Peter 25 5 2002-01-15 John 25 6 2002-01-20 John 16 7 2002-01-20 Hung 10 8 2002-02-20 John 20 9 2002-02-20 Susan 40 10 2002-02-24 Rebel 40 11 2002-02-24 Susan 15 12 2002-02-24 Mark 38 13 2002-02-24 Susan 30

A B C 0 2002-01-12 Sarah 39 1 2002-01-12 John 17 2 2002-01-12 Susan 30 6 2002-01-20 John 16 7 2002-01-20 Hung 10 8 2002-02-20 John 20 9 2002-02-20 Susan 40

3条回答

网友

1楼 · 编辑于 2024-09-28 05:27:31

创建一个日期数组，作为包含John的日期与包含Susan的日期的交集：

dates = np.intersect1d(
    df.A.values[df.B.values == 'John'], 
    df.A.values[df.B.values == 'Susan']
)

然后使用日期数组过滤数据帧

df[df.A.isin(dates)]

# outputs:
            A      B   C
0  2002-01-12  Sarah  39
1  2002-01-12   John  17
2  2002-01-12  Susan  30
8  2002-02-20   John  20
9  2002-02-20  Susan  40

计时：

比较上述jpp、ALollz和我的解决方案：

基于numpy的解决方案的效率是其他解决方案的数倍。你知道吗

In [288]: def hal(df):
     ...:     dates = np.intersect1d(
     ...:      df.A.values[df.B.values == 'John'], 
     ...:      df.A.values[df.B.values == 'Susan']
     ...:     )
     ...:     return df[df.A.isin(dates)]
     ...:

In [289]: def jpp(df):
     ...:     s = df.groupby('A')['B'].apply(set)
     ...:     return df[df['A'].map(s) >= {'John', 'Susan'}]
     ...:

In [290]: def alollz(df):
     ...:     flag = df.groupby('A').B.transform(lambda x: ((x=='Susan').any() & (x == 'John').any()).sum().astype('boo
     ...: l'))
     ...:     return df[flag==True]
     ...:

In [291]: %timeit hal(df)
394 µs ± 6.42 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

In [292]: %timeit jpp(df)
1.46 ms ± 27.9 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

In [293]: %timeit alollz(df)
4.9 ms ± 75 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

但是，通过省略一些额外的不需要的操作，并转到numpy数组进行比较，ALollz提出的解决方案可以提高2倍的速度。你知道吗

In [294]: def alollz_improved(df):
     ...:     v = df.groupby('A').B.transform(lambda x: (x.values=='Susan').any() & (x.values=='John').any())
     ...:     return df[v]
     ...:

In [295]: %timeit alollz_improved(df)
2.2 ms ± 38.3 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

网友

2楼 · 编辑于 2024-09-28 05:27:31

可以使用groupby+transform为满足该条件的组创建标志。然后你可以用这个标志来屏蔽原始的df。如果不想修改原始df，可以创建一个名为flag的独立Series，否则也可以将其分配给原始df中的一列

import pandas as pd
# As Haleemur Ali points out, use x.values to make it faster
flag = df.groupby('A').B.transform(lambda x: (x.values == 'Susan').any() & (x.values == 'John').any())

然后可以过滤df

df[flag]
#            A      B   C
#0  2002-01-12  Sarah  39
#1  2002-01-12   John  17
#2  2002-01-12  Susan  30
#8  2002-02-20   John  20
#9  2002-02-20  Susan  40

网友

3楼 · 编辑于 2024-09-28 05:27:31

创建一个序列，将每个日期映射到set个名称。然后通过语法sugar >=使用^{}：

s = df.groupby('A')['B'].apply(set)

res = df[df['A'].map(s) >= {'John', 'Susan'}]

print(res)

            A      B   C
0  2002-01-12  Sarah  39
1  2002-01-12   John  17
2  2002-01-12  Susan  30
8  2002-02-20   John  20
9  2002-02-20  Susan  40

计时：

相关问题更多 >

编程相关推荐

热门问题

热门文章