Python-Pandas特征生成作为聚合函数

1条回答

网友

1楼 · 发布于 2024-09-30 22:15:17

在一个玩具示例数据帧上，使用apply()而不是iterrows()，可以实现大约7倍的加速。在

以下是一些示例数据，从OP扩展到包含多个key值：

    ID  key dist
0    1   57  1
1    2   22  1
2    3   12  1
3    4   45  1
4    5   94  1
5    6   36  1
6    7   38  1
7    8   94  1
8    9   94  1
9   10   38  1

import pandas as pd
df = pd.read_clipboard()

根据这些数据和OP定义的计数标准，我们期望输出为：

^{pr2}$

使用OP的方法：

def features_wind2(inp):
    all_window = inp
    all_window['window1'] = 0
    for index, row in all_window.iterrows():
        lid = index
        lid1 = lid - 200
        pid = row['key']
        row['window1'] = all_window.query('index < %d & index > %d & key == %d' % (lid, lid1, pid)).count()[0]     
    return all_window

print('old solution: ')
%timeit features_wind2(df) 

old solution: 
10 loops, best of 3: 25.6 ms per loop

使用apply()：

def compute_window(row):
    # when using apply(), .name gives the row index
    # pandas indexing is inclusive, so take index-1 as cut_idx
    cut_idx = row.name - 1 
    key = row.key
    # count the number of instances key appears in df, prior to this row
    return sum(df.ix[:cut_idx,'key']==key)

print('new solution: ')
%timeit df['window1'] = df.apply(compute_window, axis='columns')

new solution: 
100 loops, best of 3: 3.71 ms per loop

请注意，对于数百万条记录，这仍然需要一段时间，并且与这个小测试用例相比，相对的性能提升可能会有所减少。在

更新
这里有一个更快的解决方案，使用groupby()和cumsum()。我制作了一些示例数据，这些数据似乎与所提供的示例大致一致，但包含1000万行。计算平均在一秒钟内完成：

# sample data
import numpy as np
import pandas as pd

N = int(1e7)
idx = np.arange(N)
keys = np.random.randint(1,100,size=N)
dists = np.ones(N).astype(int)
df = pd.DataFrame({'ID':idx,'key':keys,'dist':dists})
df = df.set_index('ID')

现在性能测试：

%timeit df['window'] = df.groupby('key').cumsum().subtract(1)

1 loop, best of 3: 755 ms per loop

这里有足够的输出来证明计算是有效的：

    dist  key  window
ID                   
0      1   83       0
1      1    4       0
2      1   87       0
3      1   66       0
4      1   31       0
5      1   33       0
6      1    1       0
7      1   77       0
8      1   49       0
9      1   49       1
10     1   97       0
11     1   36       0
12     1   19       0
13     1   75       0
14     1    4       1

注意：要将ID从索引还原为列，请在末尾使用df.reset_index()。在

相关问题更多 >

编程相关推荐

热门问题

热门文章

Python-Pandas特征生成作为聚合函数

相关问题 更多 >

编程相关推荐

热门问题

热门文章

相关问题更多 >