从where子句创建列并从分类值获取平均值

2024-10-06 15:16:31 发布

您现在位置:Python中文网/ 问答频道 /正文

以下是我的数据帧的前五行:

City    Edition NOC Medal
Athens  1896    HUN Gold
Athens  1896    AUT Silver
Athens  1896    GRE Bronze
Athens  1896    GRE Gold
Athens  1896    GRE Silver

我想创建一个新表,该表将按NOC分组,其他两列将为Average Before 1996,另一列将根据Edition列为Average After 1996。它看起来像这样(所有值都是占位符):

NOC Average Before 1996  Average After 1996
USA     30              40
URS     25              30
GBR     50              20

我的困难在于我可以为每个国家创建一个总数:

total_medal_count = olympics_df.groupby('NOC')\
                               .Medal.count()\
                               .reset_index(name="Medal_Count")\
                               .sort_values("Medal_Count", ascending=False)

NOC Medal_Count
USA 4334
URS 2049
GBR 1594

但是,我无法获得Edition列中特定值的平均值

我尝试了以下方法:

total_medal_count['Before 1996'] = np.mean(total_medal_count.Medal_Count).where(olympics_df['Edition'] < 1996)

但这不起作用,因为我无法从均值中提取where。在np.mean函数中引用一个数据帧,在where函数中引用另一个数据帧,我可能也会遇到问题


Tags: 数据silvercountwheretotalaveragebeforenoc
2条回答

您可以使用gt运算符将数据帧拆分为所需的年份。创建一个新的数据帧,将所有唯一的NOC值作为索引,这样,下一步中任何不存在的值都将被考虑在内。对分割数据帧的每个部分使用groupby。然后,将apply与计算Edition的唯一值(^{})的函数一起使用,并为每个NOC平均这些值

输入sample.csv

  City  Edition  NOC   Medal
Athens     1993  GRE    Gold
Athens     1994  AUT  Silver
Athens     1994  GRE  Bronze
Athens     1994  GRE    Gold
Athens     1994  GRE  Silver
Athens     1997  GRE  Silver
Athens     1998  HUN    Gold
Athens     1998  AUT  Silver
Athens     1998  GRE  Bronze
Athens     1998  HUN    Gold
Athens     1998  AUT  Silver
Athens     1998  GRE  Bronze
Athens     2001  GRE    Gold
Athens     2002  GRE  Silver
Athens     2003  HUN    Gold
import pandas as pd

df = pd.read_csv('sample.csv', sep='\s+')

gt1996 = df['Edition'].gt(1996)
le1996 = ~gt1996

avg_medals = lambda x: x['Edition'].value_counts().mean()

dr = pd.DataFrame(index=df['NOC'].unique())
dr['Average Before 1996'] = df[le1996].groupby('NOC').apply(avg_medals)
dr['Average After 1996'] = df[gt1996].groupby('NOC').apply(avg_medals)

print(dr)

dr

     Average Before 1996  Average After 1996
GRE                  2.0                1.25
AUT                  1.0                2.00
HUN                  NaN                1.50

这就是你需要的吗

观察:我在测试中添加了一行(第二行)

import pandas as pd

data = [
 ['Athens',  1896,    'HUN', 'Gold']
  ,['Athens',  1000,    'HUN', 'Gold']
,['Athens',  1997,    'HUN', 'Gold']
,['Athens',  1896,    'AUT', 'Silver']
,['Athens',  1896,    'GRE', 'Bronze']
,['Athens',  1896,    'GRE', 'Gold']
,['Athens',  1896,    'GRE', 'Silver']
]

#Create dataframe
df = pd.DataFrame(data, columns=['City','Edition','NOC','Medal'])

aux_df = df.groupby(by ='NOC')['Edition'].mean().reset_index();

aux_df['Average Before 1996'] = aux_df['NOC'].apply(lambda x: df[(df.NOC == x) & (df.Edition<=1996)].groupby(by ='NOC')['Edition'].mean().reset_index()['Edition'].sum())
aux_df['Average After 1996'] = aux_df['NOC'].apply(lambda x: df[(df.NOC == x) & (df.Edition>1996)].groupby(by ='NOC')['Edition'].mean().reset_index()['Edition'].sum())

aux_df['Count Before 1996'] = aux_df['NOC'].apply(lambda x: df[(df.NOC == x) & (df.Edition<=1996)].groupby(by ='NOC')['Medal'].count().reset_index()['Medal'].sum())
aux_df['Count After 1996'] = aux_df['NOC'].apply(lambda x: df[(df.NOC == x) & (df.Edition>1996)].groupby(by ='NOC')['Medal'].count().reset_index()['Medal'].sum())


#print(df.groupby(by ='NOC').mean())

print(aux_df.to_string())

enter image description here

相关问题 更多 >