按col3分组的等效选择计数(distinct col1,col2)

2024-10-02 08:29:24 发布

您现在位置:Python中文网/ 问答频道 /正文

生成数据帧:

people = ['shayna','shayna','shayna','shayna','john']
dates = ['01-01-18','01-01-18','01-01-18','01-02-18','01-02-18']
places = ['hospital', 'hospital', 'inpatient', 'hospital', 'hospital']
d = {'Person':people,'Service_Date':dates, 'Site_Where_Served':places}
df = pd.DataFrame(d)
df

Person   Service_Date   Site_Where_Served
shayna   01-01-18       hospital 
shayna   01-01-18       hospital 
shayna   01-01-18       inpatient 
shayna   01-02-18       hospital 
john     01-02-18       hospital 

我想做的是计算一对独特的人和他们的服务日期分组的网站。你知道吗

预期产量:

Site_Where_Served    Site_Visit_Count
hospital             3
inpatient            1

我的尝试:

df[['Person', 'Service_Date']].groupby(df['Site_Where_Served']).nunique().reset_index(name='Site_Visit_Count')

但它不知道如何重置索引。所以,我试着忽略这一点,我意识到这并不是计算“Person”和“Service\u Date”这两个唯一的对,因为输出如下所示:

                   Person    Service_Date
Site_Where_Served
hospital              2           2 
inpatient             1           1 

Tags: dfdateservicesitevisitwherejohnpeople
3条回答

在我看来,更好的方法是在使用groupby.size之前删除重复项:

res = df.drop_duplicates()\
        .groupby('Site_Where_Served').size()\
        .reset_index(name='Site_Visit_Count')

print(res)

  Site_Where_Served  Site_Visit_Count
0          hospital                 3
1         inpatient                 1

也许value_counts

(df.drop_duplicates()
   .Site_Where_Served
   .value_counts()
   .to_frame('Site_Visit_Count')
   .rename_axis('Site_Where_Served')
   .reset_index()
)

  Site_Where_Served  Site_Visit_Count
0          hospital                 3
1         inpatient                 1

drop_duplicatesgroupby+count

(df.drop_duplicates()
   .groupby('Site_Where_Served')
   .Site_Where_Served.count()
   .reset_index(name='Site_Visit_Count')
)

  Site_Where_Served  Site_Visit_Count
0          hospital                 3
1         inpatient                 1

注意,count/size之间的一个微小区别是前者不计算NaN条目。你知道吗


元组化,groupbynunique

这实际上只是修复您当前的解决方案,但我不建议这样做,因为这是相当冗长的步骤比必要的多。首先,对列进行tuplize,按Site_Where_Served分组,然后计数:

(df[['Person', 'Service_Date']]
   .apply(tuple, 1)
   .groupby(df.Site_Where_Served)
   .nunique()
   .reset_index(name='Site_Visit_Count')
)

  Site_Where_Served  Site_Visit_Count
0          hospital                 3
1         inpatient                 1

相关问题 更多 >

    热门问题