<p>在<code>df.groupby</code>上使用本机pandas方法应能在“本机python”循环中显著提高性能:</p>
<pre class="lang-py prettyprint-override"><code>df['time_elapsed'] = df.groupby('carId')['refill_date'].diff()
</code></pre>
<p>这里有一个小基准(在我的笔记本电脑上,YMMV…),使用100辆车,每辆车31天,
表现出几乎10倍的性能提升:</p>
<pre class="lang-py prettyprint-override"><code>import pandas as pd
import timeit
data = [{"carId": carId, "refill_date": "2020-3-"+str(day)} for carId in range(1,100) for day in range(1,32)]
df = pd.DataFrame(data)
df['refill_date'] = pd.to_datetime(df['refill_date'])
def original_method():
for c in df['carId'].unique():
df.loc[df['carId'] == c, 'time_elapsed'] = df.loc[df['carId'] == c,
'refill_date'].diff()
def using_groupby():
df['time_elapsed'] = df.groupby('carId')['refill_date'].diff()
time1 = timeit.timeit('original_method()', globals=globals(), number=100)
time2 = timeit.timeit('using_groupby()', globals=globals(), number=100)
print(time1)
print(time2)
print(time1/time2)
</code></pre>
<p>输出:</p>
<pre><code>16.6183732
1.7910263000000022
9.278687420726307
</code></pre>