基于更复杂的条件删除pandas中的行

2024-06-24 12:41:49 发布

您现在位置:Python中文网/ 问答频道 /正文

我有以下数据框:

time        id  type
2012-12-19  1   abcF1
2013-11-02  1   xF1yz
2012-12-19  1   abcF1
2012-12-18  1   abcF1
2013-11-02  1   xF1yz
2006-07-07  5   F5spo
2006-07-06  5   F5spo
2005-07-07  5   F5abc

对于给定的id,我需要找到最长日期。你知道吗

对于那个最长日期,我需要检查类型。你知道吗

如果类型与最大日期的类型不同,我必须删除给定id的每一行。你知道吗

目标数据帧示例:

time        id  type
<deleted because for id 1 the date is not the max value and the type differs from the type of the max date for id 1>
2013-11-02  1   xF1yz
<deleted because for id 1 the date is not the max value and the type differs from the type of the max date for id 1>
<deleted because for id 1 the date is not the max value and the type differs from the type of the max date for id 1>
2013-11-02  1   xF1yz
2006-07-07  5   F5spo
2006-07-06  5   F5spo //kept because although the date is not max, it has the same type as the row with the max date for id 5
<deleted because for id 5 the date is not the max value and the type differs from the type of the max date for id 5>

我怎样才能做到这一点? 我刚接触熊猫,正在努力学习如何正确使用图书馆。你知道吗


Tags: andtheidfordateisvaluetype
3条回答

使用^{}获取最大值的索引,只过滤列idtype^{}

df = df.merge(df.loc[df.groupby('id')['time'].idxmax(), ['id','type']])
print (df)
        time  id   type
0 2013-11-02   1  xF1yz
1 2013-11-02   1  xF1yz
2 2006-07-07   5  F5spo
3 2006-07-06   5  F5spo

或将^{}^{}一起使用:

df = df.merge(df.sort_values('time').drop_duplicates('id', keep='last')[["id", "type"]])

使用^{}^{}^{}^{}创建助手Series。然后使用^{}

# If neccessary cast to datetime dtype
# df['time'] = pd.to_datetime(df['time'])

s = df.set_index('type').groupby('id')['time'].transform('idxmax')
df[df.type == s.values]

[输出]

        time  id   type
1 2013-11-02   1  xF1yz
4 2013-11-02   1  xF1yz
5 2006-07-07   5  F5spo
6 2006-07-06   5  F5spo

您可以按时间对数据帧排序,然后按id分组,并选择每个分组中的最后一行。这是日期最大的一行。你知道吗

last_rows = df.sort_values('time').groupby('id').last()

然后将原始数据帧与新数据帧合并:

result = df.merge(last_rows, on=["id", "type"])
#       time_x  id   type      time_y
#0  2013-11-02   1  xF1yz  2013-11-02
#1  2013-11-02   1  xF1yz  2013-11-02
#2  2006-07-07   5  F5spo  2006-07-07
#3  2006-07-06   5  F5spo  2006-07-07

如果需要,请删除最后一个重复列:

result.drop('time_y', axis=1, inplace=True)

相关问题 更多 >