<p><strong>重要提示:</strong>因为这个答案已经很长了,所以我决定完全重写,而不是第五次更新。如果你对“历史背景”感兴趣,去看看版本历史</p>
<hr/>
<p>首先,运行一些必需的导入:</p>
<pre><code>import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from matplotlib import gridspec
import matplotlib as mpl
mpl.style.use('seaborn-paper') ## for nicer looking plots only
from lmfit import fit_report
from lmfit.models import GaussianModel, BreitWignerModel
</code></pre>
<p>然后清理数据(如上所述,另存为.csv):</p>
^{pr2}$
<p>并按每日频率重新编制索引:</p>
<pre><code>complete_date_range_idx = pd.date_range(df.index.min(), df.index.max(),freq='D')
df_filled=df.reindex(complete_date_range_idx, fill_value=np.nan).reset_index()
## obtain index values, which can be understood as time delta in days from the start
idx=df_filled.index.values ## this will be used again, in the end
## now we obtain (x,y) on basis of idx
not_na=pd.notna(df_filled['Values'])
x=idx[not_na] ## len: 176
y=df_filled['Values'][not_na].values
### let's write over the original df
df=df_filled
#####
</code></pre>
<p>现在有趣的是:用一些非对称的线型(Breit-Wigner-Fano)拟合数据,并去除低于某个阈值的“异常值”。我们首先手动声明峰值的位置(我们的初始猜测,我们可以去掉3个点),然后我们再次使用fit(fit 1)作为输入(去掉8个点),最后得到我们的最终拟合。在</p>
<p>根据要求,我们现在可以在之前创建的每日索引上插值拟合(<code>bwf_result_final.eval(x=idx)</code>),并在dataframe中创建额外的列:<code>y_fine</code>,它只保存fit,<code>y_final</code>,它保存最后的点云(即,在异常值移除之后),以及一个连接的数据集(看起来“参差不齐”)<code>y_joined</code>。
最后,我们可以根据“精细”的数据范围(<code>df['index']</code>)来绘制它。在</p>
<p><a href="https://i.stack.imgur.com/Jfo7N.png" rel="nofollow noreferrer"><img src="https://i.stack.imgur.com/Jfo7N.png" alt="Figure 1: iteratively removing outliers"/></a></p>
<p><a href="https://i.stack.imgur.com/TeMD5.png" rel="nofollow noreferrer"><img src="https://i.stack.imgur.com/TeMD5.png" alt="Figure 2: cleaned up dataset"/></a></p>
<pre><code># choose an asymmetric line shape (Fano resonance)
bwf_model = BreitWignerModel()
# make initial guesses:
params = bwf_model.make_params(center=75, amplitude=0.2, sigma=20, q=1/0.2)
# plot initial guess and fit result
bwf_result = bwf_model.fit(y, params, x=x)
#### create first figure
fig=plt.figure(figsize=(8,3),frameon=True,)
gs1 = gridspec.GridSpec(1,3,
left=0.08,right=0.95,
bottom=0.15,top=0.9,
wspace=0.1
)
a1=plt.subplot(gs1[0])
a2=plt.subplot(gs1[1])
a3=plt.subplot(gs1[2])
# first subplot
a1.set_title('Outliers from 1st guess')
## show initial x,y
a1.scatter(x,y,facecolors='None',edgecolors='b',marker='o',linewidth=1,zorder=3)
# outliers=np.argwhere(np.abs(y-bwf_result.init_fit)>1.9) ## if you want to exclude points both above and below
outliers=np.argwhere(( bwf_result.init_fit -y ) >1.9)
# remove outliers from point cloud
x_new=np.delete(x,outliers)
y_new=np.delete(y,outliers)
#### run a fit on the "cleaned" dataset
bwf_result_mod = bwf_model.fit(y_new, params, x=x_new)
a1.plot(x, bwf_result.init_fit, 'r ',label='initial guess')
a1.fill_between(x, bwf_result.init_fit, bwf_result.init_fit-1.9, color='r', hatch='///',alpha=0.2,zorder=1,label=u'guess - 1.9')
a1.scatter(x[outliers],y[outliers],c='r',marker='x',s=10**2,linewidth=1,zorder=4,label='outliers') ## show outliers
a1.plot(x_new, bwf_result_mod.best_fit, color='g',label='fit 1')
pointsRemoved=len(y)-len(y_new)
a1.text(1.05,0.5,u'↓{0} points removed'.format(pointsRemoved),ha='center',va='center',rotation=90,transform=a1.transAxes)
# second plot
a2.set_title('Outliers from 1st fit')
## show initial x,y
a2.scatter(x,y,facecolors='None',edgecolors='grey',marker='o',linewidth=.5,zorder=0,label='original data')
a2.scatter(x_new,y_new,facecolors='None',edgecolors='b',marker='o',linewidth=1,zorder=3)
a2.plot(x_new, bwf_result_mod.best_fit, color='g',label='fit 1')
# new_outliers=np.argwhere(np.abs(bwf_result_mod.residual)>0.8) ## if you want to exclude points both above and below
new_outliers=np.argwhere( bwf_result_mod.residual >0.8)
x_new_2=np.delete(x_new,new_outliers)
y_new_2=np.delete(y_new,new_outliers)
a2.scatter(x_new[new_outliers],y_new[new_outliers],c='r',marker='x',s=10**2,linewidth=1,zorder=4,label='new outliers')
a2.fill_between(x_new, bwf_result_mod.best_fit, bwf_result_mod.best_fit-0.8, color='r', hatch='///',alpha=0.2,zorder=1,label=u'fit - 0.8')
pointsRemoved=len(y_new)-len(y_new_2)
a2.text(1.05,0.5,u'↓{0} points removed'.format(pointsRemoved),ha='center',va='center',rotation=90,transform=a2.transAxes)
# third plot
_orig=len(y)
_remo=(len(y)-len(y_new_2))
_pct=_remo/(_orig/100.)
a3.set_title(u'Result ({0} of {1} removed, ~{2:.0f}%)'.format(_orig,_remo,_pct ))
x_final=np.delete(x_new,new_outliers)
y_final=np.delete(y_new,new_outliers)
## store final point cloud in the df
df.loc[x_final,'y_final']=y_final
a3.scatter(x_final,y_final,facecolors='None',edgecolors='b',marker='o',linewidth=1,zorder=3)
## make final fit:
bwf_result_final = bwf_model.fit(y_final, params, x=x_final)
a3.scatter(x,y,facecolors='None',edgecolors='grey',marker='o',linewidth=.5,zorder=0,label='original data')
a3.plot(x_final, bwf_result_final.best_fit, color='g',label='fit 2')
## now that we are "happy" with bwf_result_final, let's apply it on the df's "fine" (i.e. daily) index!
y_fine=bwf_result_final.eval(x=idx)
##
df['y_fine']=y_fine # store fit function
df['y_joined']=df['y_final'] # store final point cloud
df['y_joined'][df['y_final'].isnull()]=df['y_fine'] # join fit function points with final point cloud
#### create second figure
fig2=plt.figure(figsize=(8,3),frameon=True,)
gs2 = gridspec.GridSpec(1,1,
left=0.08,right=0.95,
bottom=0.15,top=0.9,
wspace=0.1
)
ax2=plt.subplot(gs2[0])
ax2.scatter(df['date'],df['Values'],facecolors='None',edgecolors='grey',marker='o',linewidth=1,zorder=0,label='original data')
ax2.plot(df['index'],df['y_fine'],c="g",zorder=3,label="final fit applied to all dates")
ax2.plot(df['index'],df['y_joined'],color="r",marker=".",markersize=6,zorder=2,label="(points-outliers) +fit ")
# print(df.head(30))
for a in [a1,a2,a3,ax2]:
a.set_ylim(-.5,7)
a.legend()
a1.set_ylabel('Value')
ax2.set_ylabel('Value')
for a in [a2,a3]:
plt.setp(a.get_yticklabels(),visible=False)
for a in [a1,a2,a3,ax2]:
a.set_xlabel('Days from start')
fig.savefig('outlier_removal.pdf')
fig2.savefig('final_data.pdf')
plt.show()
</code></pre>