大Pandas仅按正值分组

import numpy as np import pandas as pd flights = pd.read_csv('https://github.com/bhishanpdl/datasets/blob/master/nycflights13.csv?raw=true') print(flights.shape) print(flights.iloc[:2,:4]) print() not_cancelled = flights.dropna(subset=['dep_delay','arr_delay']) df = (not_cancelled.groupby(['year','month','day'])['arr_delay'] .mean().reset_index() ) df['avg_delay2'] = df[df.arr_delay>0]['arr_delay'].mean() print(df.head())

library(nycflights13) not_cancelled = flights %>% filter( !is.na(dep_delay), !is.na(arr_delay)) df = not_cancelled %>% group_by(year,month,day) %>% summarize( # average delay avg_delay1 = mean(arr_delay), # average positive delay avg_delay2 = mean(arr_delay[arr_delay>0])) head(df)

2条回答

网友

1楼 · 编辑于 2024-09-29 01:34:27

我会在groupby之前过滤阳性

df = (not_cancelled[not_cancelled.arr_delay >0].groupby(['year','month','day'])['arr_delay']
      .mean().reset_index()
     )
df.head()

因为，在您的代码中，df是操作完成之后的一个单独的数据帧，并且

^{pr2}$

将相同的值赋给df['avg_delay2']

编辑：与R类似，您可以使用agg一次性完成这两个操作：

def mean_pos(x):
    return x[x>0].mean()

df = (not_cancelled.groupby(['year','month','day'])['arr_delay']
      .agg({'arr_delay': 'mean', 'arr_delay_2': mean_pos})
     )
df.head()

网友

2楼 · 编辑于 2024-09-29 01:34:27

请注意，在pandas 23中，在gropby agg中使用dictionary是不推荐的，将来将被删除，因此我们不能使用该方法。在

警告

df = (not_cancelled.groupby(['year','month','day'])['arr_delay']
      .agg({'arr_delay': 'mean', 'arr_delay_2': mean_pos})
     )

FutureWarning: using a dict on a Series for aggregation
is deprecated and will be removed in a future version.

所以，为了解决这个问题，我想出了另一个主意。在

创建一个新列，使所有非正值为nan，然后执行常规的groupby。在

^{pr2}$

它提供：

   year  month  day  arr_delay  arr_delay_positive
0  2013      1    1  12.651023           32.481562
1  2013      1    2  12.692888           32.029907
2  2013      1    3   5.733333           27.660870
3  2013      1    4  -1.932819           28.309764
4  2013      1    5  -1.525802           22.558824

健全性检查

# sanity check
a = not_cancelled.query(""" year==2013 & month ==1 & day ==1 """)['arr_delay']
a = a[a>0]
a.mean() # 32.48156182212581

警告

健全性检查

相关问题更多 >

编程相关推荐

热门问题

热门文章