<p>@Kristof的答案是一个很好的起点。我注意到这个建议的速度提高了不到2倍。对于大型数据帧,还有一些需要记住的事情是使用表达式的顺序(例如,您需要创建一个新的数据帧来选择一个序列,还是可以直接生成新的序列)。当不需要富熊猫方法时,也可以直接使用numpy类型。在</p>
<p>扩展您的示例:</p>
<pre><code>In [58]: df_big = pd.DataFrame()
In [59]: for i in range(1000): df_big = df_big.append(df)
In [61]: len(df_big)
Out[61]: 10000
In [62]: dfr = df_big.to_records()
In [63]: dfr
Out[63]:
rec.array([(0, 'A1', 'BA1', 'CA1', 'D1', 900), (1, 'A2', 'BA2', 'CA2', 'D2', 900),
(2, 'A3', 'BA3', 'CA3', 'D3', 500), ...,
(7, 'A1', 'BA1', 'CA1', 'D1', 700), (8, 'A4', 'BA4', 'CA4', 'D4', 300),
(9, 'A4', 'BA4', 'CA4', 'D4', 500)],
dtype=[('index', '<i8'), ('A', '|O'), ('B', '|O'), ('C', '|O'), ('D', '|O'), ('important_col', '<i8')])
In [71]: %timeit df_big[(df_big['A']== 'A4') & (df_big['C'] == 'CA4') & (df_big['D'] == 'D4')]['important_col'].mean()
100 loops, best of 3: 2.91 ms per loop
In [72]: %timeit df_big['important_col'][(df_big['A']== 'A4') & (df_big['C'] == 'CA4') & (df_big['D'] == 'D4')].mean()
100 loops, best of 3: 2.46 ms per loop
In [73]: df_big[(df_big['A']== 'A4') & (df_big['C'] == 'CA4') & (df_big['D'] == 'D4')]['important_col'].mean()
In [74]: %timeit dfr['important_col'][(dfr['A']== 'A4') & (dfr['C'] == 'CA4') & (dfr['D'] == 'D4')].mean()
1000 loops, best of 3: 877 µs per loop
</code></pre>