Groupby客户和店铺获得平均交易频率。日期问题

2024-06-30 16:04:39 发布

您现在位置:Python中文网/ 问答频道 /正文

我有以下交易记录。日期格式为年/月/日

print(df)

   customer_id       shop date_of_transaction
0         John  McDonalds          2020-02-03
1         John  McDonalds          2020-02-04
2         John  McDonalds          2020-02-05
3         John        KFC          2020-02-06
4         John        KFC          2020-02-07
5         John        KFC          2020-02-08
6         Mary  McDonalds          2020-02-09
7         Mary  McDonalds          2020-02-10
8         Mary  McDonalds          2020-02-11
9         Mary        KFC          2020-02-12
10        Mary        KFC          2020-02-13
11         Joe        KFC          2020-02-14
12         Joe  McDonalds          2020-02-15
13         Joe  McDonalds          2020-02-16
14         Joe  McDonalds          2020-02-17
15         Joe        KFC          2020-02-18
16         Joe        KFC          2020-02-19
17         Joe        KFC          2020-02-20
18         Joe  MCDonalds          2020-02-21

我想得到每家商店的平均交易频率

例如,Joe在2月15日到2月21日之间去了4次麦当劳。这是他第一次和最后一次交易之间的6天。所以他每1.5天去一次麦当劳

我想用这个信息创建一个新的数据框。所以我试着这样做:

df.groupby(['customer_id','shop'])['date_of_transaction'].apply(lambda  x: (max(x) - min (x))/len(x))

customer_id  shop     
Joe          KFC         1 days 12:00:00
             McDonalds   1 days 12:00:00
John         KFC         0 days 16:00:00
             McDonalds   0 days 16:00:00
Mary         KFC         0 days 12:00:00
             McDonalds   0 days 16:00:00

Joe在麦当劳的平均频率是1 days。应该是1.5天

如果我删除该部门,我们将得到:

df.groupby(['customer_id','shop'])['date_of_transaction'].apply(lambda  x:(max(x) - min (x)))

customer_id  shop     
Joe          KFC         6 days
             McDonalds   6 days
John         KFC         2 days
             McDonalds   2 days
Mary         KFC         1 days
             McDonalds   2 days

只是当我试着将它除以每个商店里每个人的访问次数时,它不起作用

我曾尝试将astype(int)添加到(max(x) - min (x))中,但没有效果。我知道timedelta对象有问题,但我无法将其转换为int。我还向timedelta对象添加了.dt.days,但运气不好

理想情况下,我希望以这样的数据帧结束(注意-频率数字由以下部分组成):

  customer_id  McDonalds Frequency  KFC Frequency
0        John                    1              2
1        Mary                    3              4
2         Joe                    5              6

我的练习是df。如果加载df,可以使用dayfirst转换日期:

df['date_of_transaction'] = pd.to_datetime(df['date_of_transaction'],dayfirst=True)


df.to_dict()

{'customer_id': {0: 'John', 1: 'John', 2: 'John', 3: 'John', 4: 'John', 5: 'John', 6: 'Mary', 7: 'Mary', 8: 'Mary', 9: 'Mary', 10: 'Mary', 11: 'Joe', 12: 'Joe', 13: 'Joe', 14: 'Joe', 15: 'Joe', 16: 'Joe', 17: 'Joe', 18: 'Joe'}, 'shop': {0: 'McDonalds', 1: 'McDonalds', 2: 'McDonalds', 3: 'KFC', 4: 'KFC', 5: 'KFC', 6: 'McDonalds', 7: 'McDonalds', 8: 'McDonalds', 9: 'KFC', 10: 'KFC', 11: 'KFC', 12: 'McDonalds', 13: 'McDonalds', 14: 'McDonalds', 15: 'KFC', 16: 'KFC', 17: 'KFC', 18: 'McDonalds'}, 'date_of_transaction': {0: Timestamp('2020-02-03 00:00:00'), 1: Timestamp('2020-02-04 00:00:00'), 2: Timestamp('2020-02-05 00:00:00'), 3: Timestamp('2020-02-06 00:00:00'), 4: Timestamp('2020-02-07 00:00:00'), 5: Timestamp('2020-02-08 00:00:00'), 6: Timestamp('2020-02-09 00:00:00'), 7: Timestamp('2020-02-10 00:00:00'), 8: Timestamp('2020-02-11 00:00:00'), 9: Timestamp('2020-02-12 00:00:00'), 10: Timestamp('2020-02-13 00:00:00'), 11: Timestamp('2020-02-14 00:00:00'), 12: Timestamp('2020-02-15 00:00:00'), 13: Timestamp('2020-02-16 00:00:00'), 14: Timestamp('2020-02-17 00:00:00'), 15: Timestamp('2020-02-18 00:00:00'), 16: Timestamp('2020-02-19 00:00:00'), 17: Timestamp('2020-02-20 00:00:00'), 18: Timestamp('2020-02-21 00:00:00')}}

Tags: ofiddfdate交易customershopjohn
2条回答

这里的问题是,您将返回值视为一个日期,而它实际上是一个时间增量,1 days 12:00:001.5天,与一天一样,已经过去了12个小时

让我们使用aggregatedgroupby对数据进行一点重塑

然后我们可以使用np.timedelta64编辑您的时间增量

df1 = df.groupby(['customer_id','shop']).agg(mind=('date_of_transaction','min'),
                                      maxd=('date_of_transaction','max'),
                                      no_visits=('customer_id','count')).reset_index(0)


print(df1)

                            mind       maxd  no_visits
customer_id shop                                      
Joe         KFC       2020-02-20 2020-02-14          4
            McDonalds 2020-02-21 2020-02-15          4
John        KFC       2020-02-08 2020-02-06          3
            McDonalds 2020-02-05 2020-02-03          3
Mary        KFC       2020-02-13 2020-02-12          2
            McDonalds 2020-02-11 2020-02-09          3

df1['timedelta'] = ((df1['maxd'] - df1['mind']) / df1['no_visits']) / np.timedelta64(1,'D')

                           mind       maxd  no_visits  timedelta
customer_id shop                                                 
Joe         KFC       2020-02-20 2020-02-14          4   1.500000
            McDonalds 2020-02-21 2020-02-15          4   1.500000
John        KFC       2020-02-08 2020-02-06          3   0.666667
            McDonalds 2020-02-05 2020-02-03          3   0.666667
Mary        KFC       2020-02-13 2020-02-12          2   0.500000
            McDonalds 2020-02-11 2020-02-09          3   0.666667

然后我们使用交叉表

df2 = (
    pd.crosstab(df1["customer_id"], df1.index, df1["timedelta"], aggfunc="first")
    .add_suffix("_visits")
    .reset_index(0)
)

print(df2)

col_0 customer_id  KFC_visits  McDonalds_visits
0             Joe    1.500000          1.500000
1            John    0.666667          0.666667
2            Mary    0.500000          0.666667

或者是我们常驻的古鲁写的精彩的一句话Scott Boston

df.groupby(["customer_id", "shop"])["date_of_transaction"].agg(
    lambda x: (np.ptp(x) / np.timedelta64(1, "D")) / x.count()
).unstack(1).add_suffix('_visits')


shop         KFC_visits  McDonalds_visits
customer_id                              
Joe            1.500000          1.500000
John           0.666667          0.666667
Mary           0.500000          0.666667

我试图使用dt.days转换Timedelta对象。使用.days是有效的

所以


grouped = df.groupby(['customer_id','shop'])['date_of_transaction'].apply(lambda  x: ((max(x) - min (x)).days)/len(x))


customer_id  shop     
Joe          KFC          1.500000
             McDonalds    1.500000
John         KFC          0.666667
             McDonalds    0.666667
Mary         KFC          0.500000
             McDonalds    0.666667

为了获得所需的数据帧形状,我使用了pd.pivot_table


pd.pivot_table(grouped, values='date_of_transaction', index="customer_id", columns='shop')

shop              KFC  McDonalds
customer_id                     
Joe          1.500000   1.500000
John         0.666667   0.666667
Mary         0.500000   0.666667

另一种解决方案是在透视表中执行聚合函数方向,无需首先使用groupby:

agg_func = lambda  x: ((max(x) - min (x)).days)/len(x)
df.pivot_table('date_of_transaction', 'customer_id', 'shop', agg_func , 0)

相关问题 更多 >