用Pandas过滤组

playerID yearid votedBy ballots needed votes inducted category needed_note 2860 aaronha01 1982 BBWAA 415 312 406 Y Player NaN 3743 abbotji01 2005 BBWAA 516 387 13 N Player NaN 146 adamsba01 1937 BBWAA 201 151 8 N Player NaN 259 adamsba01 1938 BBWAA 262 197 11 N Player NaN 384 adamsba01 1939 BBWAA 274 206 11 N Player NaN 497 adamsba01 1942 BBWAA 233 175 11 N Player NaN 574 adamsba01 1945 BBWAA 247 186 7 N Player NaN 2108 adamsbo03 1966 BBWAA 302 227 1 N Player NaN

3条回答

网友

1楼 · 编辑于 2024-10-03 00:32:04

我修改了你的数据集，这样就有两个这样的组。一个有2行从N到Y，另一个有8行从N到{}。这取决于您是否在y包含行中计数。否则，它将有两个组，一个包含1行，另一个包含7行。看起来你没有时间序列列，所以我想这意味着这些行在时间上是均匀分布的。在

In [25]:

df=pd.read_clipboard()
print df
       playerID  yearid votedBy  ballots  needed  votes inducted category  needed_note 
3741  abbotji01    2005   BBWAA      516     387     13        N   Player          NaN 
2860  aaronha01    1982   BBWAA      415     312    406        Y   Player          NaN 
3743  abbotji01    2005   BBWAA      516     387     13        N   Player          NaN 
146   adamsba01    1937   BBWAA      201     151      8        N   Player          NaN 
259   adamsba01    1938   BBWAA      262     197     11        N   Player          NaN 
384   adamsba01    1939   BBWAA      274     206     11        N   Player          NaN 
497   adamsba01    1942   BBWAA      233     175     11        N   Player          NaN 
574   adamsba01    1945   BBWAA      247     186      7        N   Player          NaN 
2108  adamsbo03    1966   BBWAA      302     227      1        N   Player          NaN 
2861  aaronha01    1982   BBWAA      415     312    406        Y   Player          NaN 

In [26]:

df['isY']=(df.inducted=='Y')
df['isY']=np.hstack((0,df['isY'].cumsum().values[:-1])).T
In [27]:

print df.groupby('isY').count()
     playerID  yearid  votedBy  ballots  needed  votes  inducted  category  needed_note  isY 
0           2       2        2        2       2      2         2         2            0    2 
1           8       8        8        8       8      8         8         8            0    8 
[2 rows x 10 columns]

假设不计算Y，则平均值可以通过以下公式计算：

^{pr2}$

网友

2楼 · 编辑于 2024-10-03 00:32:04

类DataFrameGroupBy的过滤器方法对组中的每个子帧进行操作。请参见help(pd.core.groupby.DataFrameGroupBy.filter)。数据如下：

print df
  inducted playerID
0        Y        a
1        N        a
2        N        a
3        Y        b
4        N        b
5        N        c
6        N        c
7        N        c

示例代码：

^{pr2}$

网友
3楼 · 编辑于 2024-10-03 00:32:04

我模拟了我自己的数据，为你的问题做了一个简单的测试。我创建了一组名为df_inducted的玩家，其中包括最终加入的玩家，通过使用isin（）函数，我们可以确保在分析中只考虑他们。然后我求出他们约会的最小值和最大值，并求出它们的平均值。在

import pandas as pd

df = pd.DataFrame({'player':['Nate','Will','Nate','Will'], 
                   'inducted': ['Y','Y','N','N'],
                   'date':[2014,2000,2011,1999]})

df_inducted = df[df.inducted=='Y']
df_subset = df[df.player.isin(df_inducted.player)]

maxs = df_subset.groupby('player')['date'].max()
mins = df_subset.groupby('player')['date'].min()

maxs = pd.DataFrame(maxs)
maxs.columns = ['max_date']
mins = pd.DataFrame(mins)
mins.columns = ['min_date']

min_and_max = maxs.join(mins)
final = min_and_max['max_date'] - min_and_max['min_date']

print "average time:", final.mean()

相关问题更多 >

编程相关推荐

热门问题

热门文章