在pandas中为groupby设置一些规则

2024-05-02 10:44:42 发布

您现在位置:Python中文网/ 问答频道 /正文

我需要为熊猫群比设定一些规则。如果['keep']列在按日期时间分组之前有“dup by”,我希望可以忽略这些行

这是我的代码:

import pandas as pd
import numpy as np

df = pd.read_csv("sample.csv",delimiter='|')

df['datetime'] = pd.to_datetime(df['datetime'],errors = 'coerce')
most_recent_date = df.groupby(df['VIP_ID'])['datetime'].max()
most_recent_date= most_recent_date.rename("most_recent_date")
df = df.join(most_recent_date, on="VIP_ID")

df['both'] = np.where(
       ((df['keep'] == 'same tier')&(dup == 'yes')),
          df['VIP_ID']+df['datetime'].astype(str),
         df['ID']
)
df['keep'] = np.where(
    df['keep'] != 'same tier',df['keep'],
    (np.where(
         df['most_recent_date'] == df['datetime'],
         'yes',
         'dup by ' + df['VIP_ID'].astype(str)))
)



df.loc[df.duplicated(subset=['both'], keep = False),'keep'] = 'same time'
df = df.drop(columns = ['both','most_recent_date'])
print(df)

这段代码使所有keep列都变成“dup by”

示例csv:

ID|VIP_ID|TIER|datatime|keep
1|F08210020403|GO|2014-05-17 00:00:00|same tier
2|F08210020403|GO|2014-04-18 00:00:00|same tier
3|F08210020403|FO||dup by F08210020403
4|F08210020403|FO||dup by F08210020403
5|F08210020403|FO|2016-09-18 00:00:00|dup by F08210020403
6|F08210020403|FO|2016-05-10 00:00:00|dup by F08210020403
7|F08210020403|FO||dup by F08210020403
8|F08210020403|FO||dup by F08210020403

因为2016-05-10 00:00:00是F08210020403的最大日期时间,所以所有keep列都将显示F08210020403的dup。我希望我可以设置一些关于keep是否包含“dup”的规则,忽略此行。之后 groupby保留行

这是我的输出:

1|F08210020403|GO|2014-05-17 00:00:00|dup by F08210020403
2|F08210020403|GO|2014-04-18 00:00:00|dup by F08210020403
3|F08210020403|FO||dup by F08210020403
4|F08210020403|FO||dup by F08210020403
5|F08210020403|FO|2016-09-18 00:00:00|dup by F08210020403
6|F08210020403|FO|2016-05-10 00:00:00|dup by F08210020403
7|F08210020403|FO||dup by F08210020403
8|F08210020403|FO||dup by F08210020403

预期产出:

1|F08210020403|GO|2014-05-17 00:00:00|yes
2|F08210020403|GO|2014-04-18 00:00:00|dup by F08210020403
3|F08210020403|FO||dup by F08210020403
4|F08210020403|FO||dup by F08210020403
5|F08210020403|FO|2016-09-18 00:00:00|dup by F08210020403
6|F08210020403|FO|2016-05-10 00:00:00|dup by F08210020403
7|F08210020403|FO||dup by F08210020403
8|F08210020403|FO||dup by F08210020403

任何帮助都将不胜感激


Tags: idgomostdfdatetimedatebynp
1条回答
网友
1楼 · 发布于 2024-05-02 10:44:42

IIUC:

尝试:

c=df['keep'].str.contains('dup by')
#created a condition which check if 'keep' column contains 'dup by' or not
df['datetime'] = pd.to_datetime(df['datetime'],errors = 'coerce')
most_recent_date = df[~c].groupby(df['VIP_ID'])['datetime'].max()
#excluded those rows in groupby where 'keep' contains 'dup by'
df['most_recent_date']=df['VIP_ID'].map(most_recent_date)
df['both'] = np.where((df['keep'] == 'same tier') & c,df['VIP_ID']+df['datetime'].astype(str),df['ID'])
df['keep'] = np.where(
    df['keep'] != 'same tier',df['keep'],
    (np.where(
         df['most_recent_date'] == df['datetime'],
         'yes',
         'dup by ' + df['VIP_ID'].astype(str)))
)
df.loc[df.duplicated(subset=['both'], keep = False),'keep'] = 'same time'
df = df.drop(columns = ['both','most_recent_date'])

相关问题 更多 >