<p>有一个<a href="http://docs.scipy.org/doc/scipy-0.14.0/reference/generated/scipy.stats.mstats.winsorize.html" rel="noreferrer">winsorize function in scipy.stats.mstats</a>可以考虑使用。但是,请注意,它返回的值与<code>winsorize_series</code>略有不同:</p>
<pre><code>In [126]: winsorize_series(pd.Series(range(20), dtype='float'))[0]
Out[126]: 0.95000000000000007
In [127]: mstats.winsorize(pd.Series(range(20), dtype='float'), limits=[0.05, 0.05])[0]
Out[127]: 1.0
</code></pre>
<hr/>
<p>用<code>mstats.winsorize</code>代替<code>winsorize_series</code>可能(取决于N,M,p)快1.5倍:</p>
<pre><code>import numpy as np
import pandas as pd
from scipy.stats import mstats
def using_mstats_df(df):
return df.apply(using_mstats, axis=0)
def using_mstats(s):
return mstats.winsorize(s, limits=[0.05, 0.05])
N, M, P = 10**5, 10, 10**2
dates = pd.date_range('2001-01-01', periods=N//P, freq='D').repeat(P)
df = pd.DataFrame(np.random.random((N, M))
, index=dates)
df.index.names = ['DATE']
grouped = df.groupby(level='DATE')
</code></pre>
<hr/>
<pre><code>In [122]: %timeit result = grouped.apply(winsorize_df)
1 loops, best of 3: 17.8 s per loop
In [123]: %timeit mstats_result = grouped.apply(using_mstats_df)
1 loops, best of 3: 11.2 s per loop
</code></pre>