基于另一列中的唯一值计算数据帧中某一列中某项出现的次数

2024-06-26 03:14:42 发布

您现在位置:Python中文网/ 问答频道 /正文

我有一个数据集,里面有给潜在客户发邮件的人,还有互相发邮件的人,有时间戳和邮件ID。我想做的是把它汇总成一个数据框,每个人发了多少封邮件,每个人收到了多少封。 下面的dfu是输入的模拟数据集。 df\u out是我想要的输出(我想要它以最高的发送方计数排序,然后是最高的接收计数)。 我尝试过使用groupby和size,并尝试了三种不同的方法(df1、df2和df3)。然而,我甚至不能得到正确的发送计数(如dfïu out)。我该怎么做?下面是python代码。你知道吗

import pandas as pd

df_in = pd.DataFrame({
'sender':['Able Boy','Able Boy','Able Boy','Mark L. Taylor','Mark L. Taylor','Mark L. Taylor','scott kirk','scott kirk','scott kirk','scott kirk'],
'receiver':['Toni Z. Zapata','Mark Angel','Johnny C. Cash','paul a boyd','michelle fam','debbie bradford','Mark Angel','Johnny C. Cash','Able Boy','Mark L. Taylor'],
'timeContact':[911929000000,911929000000,910228000000,911497000000,911497000000,911932000000,914261000000,914267000000,914269000000,914276000000],
'email_ID':['<A34E5R>','<A34E5R>','<B34E5R>','<C34E5R>','<C34E5R>','<C36E5R>','<C36E5A>','<C36E5B>','<C36E5C>','<C36E5D>']
})

print("\ndf_in is:")
print(df_in)

df_out = pd.DataFrame({
'person':['scott kirk','Able Boy','Mark L. Taylor','Mark Angel','Toni Z. Zapata','Johnny C. Cash','paul a boyd','michelle fam','debbie bradford'],
'number_send':[4,2,2,0,0,0,0,0,0],
'number_received':[0,2,1,2,1,1,1,1,1]
})

print()
print("\ndf_out is:")
print(df_out)

df1 = df_in.groupby(['email_ID','sender']).size()
print()
print("\ndf1 is:")
print(df1)

df2 = df_in.groupby(['sender']).size()
print()
print("\ndf2 is:")
print(df2)

df3 = df_in.groupby(['sender','email_ID']).size()
print()
print("\ndf3 is:")
print(df3)

Tags: iniddfsizeisableoutscott
3条回答

您可以使用来自列发送方和接收方的unique值创建列person。然后^{}此列包含来自发送方和接收方的^{}。最后fillnasort_values在参数ascending=False的两列计数上

df_out = pd.DataFrame({'person': pd.np.unique(df_in[['sender','receiver']].values.flatten())})
df_out['number_send'] = df_out.person.map(df_in.drop_duplicates(subset=['sender','email_ID'])
                                               .sender.value_counts())
df_out['number_received'] = df_out.person.map(df_in.receiver.value_counts())
df_out = df_out.fillna(0).sort_values(by=['number_send', 'number_received'], ascending=False)\
               .reset_index(drop=True)
print (df_out)
            person  number_send  number_received
0       scott kirk          4.0              0.0
1         Able Boy          2.0              1.0
2   Mark L. Taylor          2.0              1.0
3   Johnny C. Cash          0.0              2.0
4       Mark Angel          0.0              2.0
5   Toni Z. Zapata          0.0              1.0
6  debbie bradford          0.0              1.0
7     michelle fam          0.0              1.0
8      paul a boyd          0.0              1.0

使用melt编辑为正确计算email_ID列的值)
meltsenderreceiver和do groupby在其上使用nunique。接下来,索引级别上的unstacksum=1

df1 = df_in.melt(id_vars='email_ID', value_vars=['sender', 'receiver'])
df_new = (df1.groupby([*df1.columns], sort=False)
             .email_ID.nunique().unstack(1).sum(level=1))

Out[250]:
variable         sender  receiver
value
Able Boy            2.0       1.0
Toni Z. Zapata      0.0       1.0
Mark Angel          0.0       2.0
Johnny C. Cash      0.0       2.0
Mark L. Taylor      2.0       1.0
paul a boyd         0.0       1.0
michelle fam        0.0       1.0
debbie bradford     0.0       1.0
scott kirk          4.0       0.0

您看到的是nunique,而不是countsize

(pd.merge(df_in.groupby('sender').email_ID.nunique(),  # count email sent by ID
         df_in.groupby('receiver').email_ID.nunique(), # count email received by ID
         left_index=True,                              # merge on sender  
         right_index=True,                             # merge on receiver
         how='outer')
 .fillna(0)                                            # replace missing with Nan
 .rename(columns={'email_ID_x':'number_send',          # rename columns as needed
                  'email_ID_y':'number_received'})
)

输出:

                 number_send  number_received
Able Boy                 2.0              1.0
Johnny C. Cash           0.0              2.0
Mark Angel               0.0              2.0
Mark L. Taylor           2.0              1.0
Toni Z. Zapata           0.0              1.0
debbie bradford          0.0              1.0
michelle fam             0.0              1.0
paul a boyd              0.0              1.0
scott kirk               4.0              0.0

相关问题 更多 >