Pandas：我使用的apply函数给了我错误的结果

a_id b_received brand_id c_consumed type_received date output \ 0 sam soap bill oil edibles 2011-01-01 1 1 sam oil chris NaN utility 2011-01-02 1 2 sam brush dan soap grocery 2011-01-03 0 3 harry oil sam shoes clothing 2011-01-04 1 4 harry shoes bill oil edibles 2011-01-05 1 5 alice beer sam eggs breakfast 2011-01-06 0 6 alice brush chris brush cleaning 2011-01-07 1 7 alice eggs NaN NaN edibles 2011-01-08 1

def probability(x): y=[] for i in range(len(x)): y.append(float(x[i])/float(len(x))) return y df2['prob']= (df2.groupby('a_id') .apply(probability(['output'])) .reset_index(level='a_id', drop=True))

1条回答

网友

1楼 · 发布于 2024-07-03 07:02:31

我认为您可以简单地将output列与已计算的分组.groupby('a_id')['output']一起传递给GroupBy对象，然后使用函数probability，该函数只返回带len的除法列output：

def probability(x):
    #print x
    return x / len(x)

df2['prob']= (df2.groupby('a_id')['output']
           .apply(probability)
           .reset_index(level='a_id', drop=True))

或使用lambda：

df2['prob']= (df2.groupby('a_id')['output']
           .apply(lambda x: x / len(x) )
           .reset_index(level='a_id', drop=True))

使用^{}可以实现更简单、更快的解决方案：

df2['prob']= df2['output'] / df2.groupby('a_id')['output'].transform('count')

print df2
    a_id b_received brand_id c_consumed type_received        date  output  \
0    sam       soap     bill        oil       edibles  2011-01-01       1   
1    sam        oil    chris        NaN       utility  2011-01-02       1   
2    sam      brush      dan       soap       grocery  2011-01-03       0   
3  harry        oil      sam      shoes      clothing  2011-01-04       1   
4  harry      shoes     bill        oil       edibles  2011-01-05       1   
5  alice       beer      sam       eggs     breakfast  2011-01-06       0   
6  alice      brush    chris      brush      cleaning  2011-01-07       1   
7  alice       eggs      NaN        NaN       edibles  2011-01-08       1   

       prob  
0  0.333333  
1  0.333333  
2  0.000000  
3  0.500000  
4  0.500000  
5  0.000000  
6  0.333333  
7  0.333333

时间安排：

In [505]: %timeit (df2.groupby('a_id')['output'].apply(lambda x: x / len(x) ).reset_index(level='a_id', drop=True))
The slowest run took 10.99 times longer than the fastest. This could mean that an intermediate result is being cached 
100 loops, best of 3: 1.73 ms per loop

In [506]: %timeit df2['output'] / df2.groupby('a_id')['output'].transform('count')
The slowest run took 5.03 times longer than the fastest. This could mean that an intermediate result is being cached 
1000 loops, best of 3: 449 µs per loop

相关问题更多 >

编程相关推荐

热门问题

热门文章