Python:基于包含多个列值的多个列表筛选行

2024-10-01 19:24:08 发布

您现在位置:Python中文网/ 问答频道 /正文

我有一个大约550万行和13列的大型数据框架。我想根据两列的值筛选该表:“product\u id”和“return\u reason”

如果product_id和return_reason都有重复的值,我想选择整行

例如,假设我有以下df:

Index  product_id  return_reason  col3      col4      col5
1      A0001       Size_too_big   string31  string41  string51
2      A0001       Size_too_big   string32  string42  string52
3      A0001       Defective      string33  string43  string53
4      B0002       Defective      string34  string44  string54
5      B0002       Defective      string35  string45  string55
6      B0002       Changed_mind   string36  string46  string56
7      C0003       Size_small     string37  string47  string57
8      D0004       Size_small     string38  string48  string58
9      D0004       Size_too_big   string39  string49  string59
10     D0004       Size_small     string310 string410 string510

我希望对其进行过滤,以便获得:

Index  product_id  return_reason  col3      col4      col5
1      A0001       Size_too_big   string31  string41  string51
2      A0001       Size_too_big   string32  string42  string52
4      B0002       Defective      string34  string44  string54
5      B0002       Defective      string35  string45  string55
8      D0004       Size_small     string38  string48  string58
10     D0004       Size_small     string310 string410 string510

我已尝试使用groupby函数:

dup_df = df.groupby(['product_id', 'return_reason']).size().to_frame('count').reset_index()
ismorethanNcount = dup_df['count'] > 1
dup_morethanN = dup_df[ismorethanNcount]
print(dup_morethanN)

这导致:

Index  product_id  return_reason  count
1      A0001       Size_too_big   2
4      B0002       Defective      2
8      D0004       Size_small     2

然后,我将新DataFrame中“product_id”和“return_reason”列的值放入列表中:

list_productid = list(dup_morethanN['product_id'])
list_returnreason = list(dup_morethanN['return_reason'])

然后,我使用上述列表作为筛选df的标准:

new_df = df[df['product_id'].isin(list_productid) & df['return_reason'].isin(list_returnreason)]

这使得:

Index  product_id  return_reason  col3      col4      col5
1      A0001       Size_too_big   string31  string41  string51
2      A0001       Size_too_big   string32  string42  string52
3      A0001       Defective      string33  string43  string53
4      B0002       Defective      string34  string44  string54
5      B0002       Defective      string35  string45  string55
8      D0004       Size_small     string38  string48  string58
9      D0004       Size_too_big   string39  string49  string59
10     D0004       Size_small     string310 string410 string510

如您所见,此方法根据我给出的列表独立筛选“产品id”和“退货原因”。我应该怎么做

多谢各位


Tags: iddfsizereturnproductlistsmalltoo
1条回答
网友
1楼 · 发布于 2024-10-01 19:24:08

如果需要所有行都被2列复制,请在^{}中使用^{}keep=False

df = df[df.duplicated(['product_id', 'return_reason'], keep=False)]
print (df)
   Index product_id return_reason       col3       col4       col5
0      1      A0001  Size_too_big   string31   string41   string51
1      2      A0001  Size_too_big   string32   string42   string52
3      4      B0002     Defective   string34   string44   string54
4      5      B0002     Defective   string35   string45   string55
7      8      D0004    Size_small   string38   string48   string58
9     10      D0004    Size_small  string310  string410  string510

或者,如果需要按^{}^{}^{}^{}过滤的2行2列的行:

df = df[df.groupby(['product_id', 'return_reason'])['product_id'].transform('size').eq(2)]
print (df)
   Index product_id return_reason       col3       col4       col5
0      1      A0001  Size_too_big   string31   string41   string51
1      2      A0001  Size_too_big   string32   string42   string52
3      4      B0002     Defective   string34   string44   string54
4      5      B0002     Defective   string35   string45   string55
7      8      D0004    Size_small   string38   string48   string58
9     10      D0004    Size_small  string310  string410  string510

相关问题 更多 >

    热门问题