<p>不使用<code>ngroup</code>,而是编写我们自己的函数来创建<code>group_id</code>列怎么样?你知道吗</p>
<p>下面是一段代码片段,它似乎提供了更好的性能:</p>
<pre><code>from memory_profiler import memory_usage
import time
import pandas as pd
import numpy as np
N_values = [10**k for k in range(4, 9)]
stats = pd.DataFrame(index=N_values, dtype=float, columns=['time', 'basemem', 'groupby_mem'])
for N in N_values:
df = pd.DataFrame(
np.hstack([np.random.randint(0, 2, (N, 2)), np.random.normal(5, 1, (N, 1))]),
columns=['male', 'edu', 'wage']
)
def groupby_ngroup():
#df.groupby(['male', 'edu']).ngroup()
df['group_id'] = 2*df.male + df.edu
def foo():
pass
basemem = max(memory_usage(proc=foo))
tic = time.time()
mem = max(memory_usage(proc=groupby_ngroup))
toc = time.time() - tic
stats.loc[N, 'basemem'] = basemem
stats.loc[N, 'groupby_mem'] = mem
stats.loc[N, 'time'] = toc
stats['mem_ratio'] = stats.eval('groupby_mem/basemem')
stats
time basemem groupby_mem mem_ratio
10000 0.117921 2370.792969 79.761719 0.033643
100000 0.026921 84.265625 84.324219 1.000695
1000000 0.067960 130.101562 130.101562 1.000000
10000000 0.220024 308.378906 536.140625 1.738577
100000000 0.751135 2367.187500 3651.171875 1.542409
</code></pre>
<p>本质上,我们使用列是数字的事实,并将它们视为二进制数。<code>group_id</code>应为十进制等效值。你知道吗</p>
<p>将其缩放为三列可以得到类似的结果。为此,请将数据帧初始化替换为以下内容:</p>
<pre><code>df = pd.DataFrame(
np.hstack([np.random.randint(0, 2, (N, 3)), np.random.normal(5, 1, (N, 1))]),
columns=['male', 'edu','random1', 'wage']
)
</code></pre>
<p>组id函数:</p>
<pre><code>def groupby_ngroup():
df['group_id'] = 4*df.male + 2*df.edu + df.random1
</code></pre>
<p>测试结果如下:</p>
<pre><code> time basemem groupby_mem mem_ratio
10000 0.050006 78.906250 78.980469 1.000941
100000 0.033699 85.007812 86.339844 1.015670
1000000 0.066184 147.378906 147.378906 1.000000
10000000 0.322198 422.039062 691.179688 1.637715
100000000 1.233054 3167.921875 5183.183594 1.636146
</code></pre>