获取关于2个分组的数据帧行的最大计数

2024-09-30 10:33:04 发布

您现在位置:Python中文网/ 问答频道 /正文

我有一个数据帧(df),希望获得关于列“国家”和“条件”的最大“NCT_ID”(不仅是唯一的值,而且是每次出现的值)。因此,对于“country”中的每个国家,我将在“CONDITION”中使用n(为简单起见,设置n=2)最常见的条件,按最大值排序。 df具有以下结构(所有列的值都不同,包括“国家”,这只是一小部分):

    NCT_ID      CONDITION                   COUNTRY
0   NCT00000261 Substance-Related Disorders United States
1   NCT00000262 Opioid-Related Disorders    United States
2   NCT00000263 Substance-Related Disorders United States
3   NCT00000263 Substance-Related Disorders United States
4   NCT00000264 Heart disease               Canada
5   NCT00000264 Heart disease               Canada
6   NCT00000267 Heart disease               Canada
7   NCT00000264 Cancer                      Canada
8   NCT00000268 Cancer                      Canada

您可以按如下方式加载:

import pandas as pd

df = pd.DataFrame([["NCT00000261", "Substance-Related Disorders", "United States"],
                   ["NCT00000262", "Opioid-Related Disorders", "United States"],
                   ["NCT00000263", "Substance-Related Disorders", "United States"],
                   ["NCT00000263", "Substance-Related Disorders", "United States"],
                   ["NCT00000264", "Heart disease", "Canada"],
                   ["NCT00000264", "Heart disease", "Canada"],
                   ["NCT00000267", "Heart disease", "Canada"],
                   ["NCT00000264", "Cancer", "Canada"],
                   ["NCT00000268", "Cancer", "Canada"]
                  ],
                  columns=["NCT_ID", "CONDITION", "COUNTRY"]
                 )

因此,我希望最终结果如下所示:

    COUNTS  CONDITION                   COUNTRY
0   3       Substance-Related Disorders United States
0   1       Opioid-Related Disorders    United States
1   3       Heart disease               Canada
1   2       Cancer                      Canada

最终df应显示n个最常见的条件,在n个国家中,总计数最大(条件总数)。 到目前为止我所做的: 在https://stackoverflow.com/a/17679517/7445528之后, 我尝试过:

# df_combined = df_combined.groupby(['COUNTRY', 'CONDITION']).size()
# df_combined = df_combined.groupby(['COUNTRY', 'CONDITION']).size().groupby(level=0).max()
# df_combined = df_combined.groupby(['COUNTRY', 'CONDITION']).size().reset_index().groupby('COUNTRY')[[0]].max()

但这并没有得到正确的数据帧结果。 要查看到目前为止的整个项目,请执行以下操作: https://github.com/Gustav-Rasmussen/AACT-Analysis/tree/master


Tags: dfconditioncountryunitedrelatedgroupbystatescombined
2条回答

试试这个:

df.groupby(['CONDITION','COUNTRY']).count().rename(columns={'NCT_ID':'COUNT'}).reset_index().sort_values(by='COUNT', ascending=False)
````
new_df = df.groupby(['CONDITION', 'COUNTRY']).apply(len).reset_index(name='COUNTS')

new_df.sort_values(by='COUNTS', axis=0, inplace=True, ascending=False)

相关问题 更多 >

    热门问题