diff()的不同用法产生不同的结果。为什么?它们意味着什么?

2024-06-25 07:19:20 发布

您现在位置:Python中文网/ 问答频道 /正文

在我的分析中,diff()函数的不同用法产生不同的结果。为什么会这样?它们意味着什么?我不太清楚。有人帮忙吗?你知道吗

我在jupyter笔记本上分析我的行为实验。对于每一位参与者,我都有他们播种和收获的苹果、大米和柚木的试验数据。我正在尝试平滑和规范化“柚木份额”(这是一个农业模拟游戏,“柚木份额”是每个试验中播种的柚木和收获的柚木的差异),然后找出试验之间的差异。然而,当我以两种不同的方式使用diff()时,会产生两种不同的结果。为什么会这样?你知道吗

场景1:

#working out correlation for participant Parika

name = 'Parika'
fname = name + '.xlsx' 
data = pd.read_excel(fname)
data.columns = data.columns.str.rstrip()

data['apple-share'] =  [ i for i in np.cumsum(data[:]['Apples sown'].values - data[:]['Apples reaped'].values).flatten()]
data['rice-share'] =  [ i for i in np.cumsum(data[:]['Rice sown'].values - data[:]['Rice reaped'].values).flatten()]
data['teak-share'] =  [ i for i in np.cumsum(data[:]['Teak sown'].values - data[:]['Teak reaped'].values).flatten()]


df = ((data['teak-share'].rolling(window=25, min_periods = 1, win_type='parzen', center=True).mean() - data['teak-share'][24:].mean())/data['teak-share'][24:].std()).diff()
df.plot(kind="line")
for x in data[data['Resource Cost']>5000]['Simulation No'].values:
    plt.axvline(x, color='red', linestyle=':', linewidth=2)
    plt.xticks(np.arange(0,120, step= 24), (data['Block'][0], data['Block'][24][0], data['Block'][48][0], data['Block'][72][0], data['Block'][96][0]))

N = range(5)
cumdev = 0
for n in N:
    cumdev = cumdev + df[data[data['Resource Cost']>5000]['Simulation No'].values + n].sum()

print(cumdev)
plt.title("Smoothed")
plt.ylabel("Teak share")
plt.xlabel("Trials")
plt.show()

这里,df的计算方法是先平滑,然后归一化,然后取‘diff()’ 产量: 绘图=plot of teak share

场景2:

#working out correlation for participant Parika

name = 'Parika'
fname = name + '.xlsx' 
data = pd.read_excel(fname)
data.columns = data.columns.str.rstrip()

data['apple-share'] =  [ i for i in np.cumsum(data[:]['Apples sown'].values - data[:]['Apples reaped'].values).flatten()]
data['rice-share'] =  [ i for i in np.cumsum(data[:]['Rice sown'].values - data[:]['Rice reaped'].values).flatten()]
data['teak-share'] =  [ i for i in np.cumsum(data[:]['Teak sown'].values - data[:]['Teak reaped'].values).flatten()]


df = ((data['teak-share'].rolling(window=25, min_periods = 1, win_type='parzen', center=True).mean() - data['teak-share'][24:].mean())/data['teak-share'][24:].std())
df.plot(kind="line")
for x in data[data['Resource Cost']>5000]['Simulation No'].values:
    plt.axvline(x, color='red', linestyle=':', linewidth=2)
    plt.xticks(np.arange(0,120, step= 24), (data['Block'][0], data['Block'][24][0], data['Block'][48][0], data['Block'][72][0], data['Block'][96][0]))

N = range(5)
cumdev = 0
for n in N:
    cumdev = cumdev + df.diff()[data[data['Resource Cost']>5000]['Simulation No'].values + n].sum()

print(cumdev)
plt.title("Smoothed")
plt.ylabel("Teak share")
plt.xlabel("Trials")
plt.show()

这里,df的计算方法与上面相同,但是没有'diff()'。“diff()”是在计算cumdev时完成的。 产量: 绘图-plot of teak share

红线表示他们面临预算超支的情况。尽管康德夫在这两种情况下都是一样的,但情节是不同的。我不清楚为什么会这样。请帮忙?你知道吗


Tags: insharedffordatanpdiffplt