查找以前不存在的客户id

2024-09-28 21:56:38 发布

您现在位置:Python中文网/ 问答频道 /正文

我想筛选在以前的数据中不存在的 customer_id's,因此所有 new_customer_ids都是2020-01-10上新增的,而在2020-01-01上不存在

主楼

date          customer_id   amount_spent 
2020-01-01    24            123
2020-01-10    24            145
2020-01-01    58             89
2020-01-10    58             67
2020-01-01    98             34
2020-01-10    99             86
2020-01-10    67            140
2020-01-10    32            321
2020-01-10    75             76

输出功率

new_customer_id  amount_spent 
32           321
75            76
67           140

我曾尝试在Pandas中使用shift功能,但这对我不起作用

编辑

df = pd.DataFrame([["2020-01-01",24,123],
["2020-01-10",24,145],
["2020-01-01",58,89],
["2020-01-10",58,67],
["2020-01-01",98,34],
["2020-01-10",98,86],
["2020-01-10",67,140],
["2020-01-10",32,321],
["2020-01-10",75,76]],columns = ["date","customer_id","amount_spent" ])

Tags: 数据功能idids编辑pandasdfnew
3条回答

IIUC您可以获取customer_id中存在的2020-01-01,然后过滤掉它们:

s = df.loc[df["date"]=="2020-01-01", "customer_id"]

print (df[~df["customer_id"].isin(s)])

         date  customer_id  amount_spent
5  2020-01-10           99            86
6  2020-01-10           67           140
7  2020-01-10           32           321
8  2020-01-10           75            76

这是另一个解决方案

import numpy as np

mask = df.groupby('customer_id').transform(np.size).eq(1)

    date  amount_spent
0  False         False
1  False         False
2  False         False
3  False         False
4   True          True
5   True          True
6   True          True
7   True          True
8   True          True

df[mask['date'] & df.date.eq('2020-01-10')]

         date  customer_id  amount_spent
5  2020-01-10           99            86
6  2020-01-10           67           140
7  2020-01-10           32           321
8  2020-01-10           75            76

假设您的示例中有一个输入错误(99是98)。您可以执行以下操作:

df = pd.DataFrame([["2020-01-01",24,123],
["2020-01-10",24,145],
["2020-01-01",58,89],
["2020-01-10",58,67],
["2020-01-01",98,34],
["2020-01-10",98,86],
["2020-01-10",67,140],
["2020-01-10",32,321],
["2020-01-10",75,76]],columns = ["date","customer_id","amount_spent" ])

df["order"] = df.groupby("customer_id").cumcount()

df[(df["date"] == "2020-01-10") & (df["order_x"]==0)]

输出:

    date        customer_id amount_spent    order_x order_y
6   2020-01-10  67          140             0       0
7   2020-01-10  32          321             0       0
8   2020-01-10  75          76              0       0

这需要根据df的复杂性进行编辑

相关问题 更多 >