<p>解决这个问题的好方法是矢量化。为此,我喜欢使用<code>np.where</code>。</p>
<pre><code>import pandas as pd
import numpy as np
from scipy.stats import mstats
import timeit
data = pd.Series(range(20), dtype='float')
def WinsorizeCustom(data):
quantiles = data.quantile([0.05, 0.95])
q_05 = quantiles.loc[0.05]
q_95 = quantiles.loc[0.95]
out = np.where(data.values <= q_05,q_05,
np.where(data >= q_95, q_95, data)
)
return out
</code></pre>
<p>为了进行比较,我将<code>scipy</code>中的函数包装在一个函数中:</p>
<pre><code>def WinsorizeStats(data):
out = mstats.winsorize(data, limits=[0.05, 0.05])
return out
</code></pre>
<p>但正如您所看到的,尽管我的函数非常快,但它离Scipy实现还很远:</p>
<pre><code>%timeit WinsorizeCustom(data)
#1000 loops, best of 3: 842 µs per loop
%timeit WinsorizeStats(data)
#1000 loops, best of 3: 212 µs per loop
</code></pre>
<p>如果您有兴趣阅读更多关于加速pandas代码的内容,我建议您使用<a href="https://engineering.upside.com/a-beginners-guide-to-optimizing-pandas-code-for-speed-c09ef2c6a4d6" rel="nofollow noreferrer">Optimization Pandas for speed</a>和<a href="https://www.labri.fr/perso/nrougier/from-python-to-numpy/" rel="nofollow noreferrer">From Python to Numpy</a>。</p>