擅长:python、mysql、java
<p>上面@EHB的解决方案是有帮助的,但它是不正确的。具体地说,在<em>median_abs_deviation</em>中计算的滚动中值属于<em>difference</em>,它本身就是每个数据点与<em>rolling_mean</em>中计算的滚动中值之间的差值,但它应该是滚动窗口中的数据与窗口上的中值之间差异的中值。我把上面的代码改了:</p>
<pre><code>def hampel(vals_orig, k=7, t0=3):
'''
vals: pandas series of values from which to remove outliers
k: size of window (including the sample; 7 is equal to 3 on either side of value)
'''
#Make copy so original not edited
vals = vals_orig.copy()
#Hampel Filter
L = 1.4826
rolling_median = vals.rolling(window=k, center=True).median()
MAD = lambda x: np.median(np.abs(x - np.median(x)))
rolling_MAD = vals.rolling(window=k, center=True).apply(MAD)
threshold = t0 * L * rolling_MAD
difference = np.abs(vals - rolling_median)
'''
Perhaps a condition should be added here in the case that the threshold value
is 0.0; maybe do not mark as outlier. MAD may be 0.0 without the original values
being equal. See differences between MAD vs SDV.
'''
outlier_idx = difference > threshold
vals[outlier_idx] = np.nan
return(vals)
</code></pre>