Pandas数据框中的外推值问题的回答

Pandas数据框中的外推值

回答此问题可获得 20 贡献值，回答如果被采纳可获得 50 分。

0 条评论
分类：Python问答

默认排序时间排序

1 个回答

匿名 1天前

　擅长：python、mysql、java

<h2>推断熊猫</h2> <code>DataFrame</code>可能是外推的，但是熊猫内部没有简单的方法调用，需要另一个库（例如<a href="http://docs.scipy.org/doc/scipy/reference/optimize.html" rel="nofollow noreferrer">scipy.optimize</a>）。 <h2>外推</h2> 一般来说，外推需要确定<a href="https://en.wikipedia.org/wiki/Extrapolation#Quality_of_extrapolation" rel="nofollow noreferrer">assumptions about the data</a>被外推。一种方法是通过<a href="https://en.wikipedia.org/wiki/Curve_fitting" rel="nofollow noreferrer">curve fitting</a>一些通用的参数化方程来找到最能描述现有数据的参数值，然后用于计算超出该数据范围的值。这种方法的难点和局限性在于，当选择参数化方程时，必须对趋势进行一些假设。这可以通过使用不同的方程进行反复试验来找到，以给出所需的结果，或者有时可以从数据源中推断出来。该问题中提供的数据实际上不够大，不足以获得一个很好的拟合曲线；但是，它足以说明问题。 下面是用3阶多项式外推<code>DataFrame</code>的一个例子 <blockquote> f(x) = a x3 + b x2 + c x + d <a href="https://www.desmos.com/calculator/sx3wbe037u" rel="nofollow noreferrer">(Eq. 1)</a> </blockquote> 此通用函数（<code>func()</code>）对每个列进行曲线拟合，以获得唯一的列特定参数（即a，b，c，d）。然后，使用这些参数化方程来外推每个列中具有<code>NaN</code>s的所有索引的数据 <pre class="lang-py prettyprint-override"><code>import pandas as pd from cStringIO import StringIO from scipy.optimize import curve_fit df = pd.read_table(StringIO(''' neg neu pos avg 0 NaN NaN NaN NaN 250 0.508475 0.527027 0.641292 0.558931 500 NaN NaN NaN NaN 1000 0.650000 0.571429 0.653983 0.625137 2000 NaN NaN NaN NaN 3000 0.619718 0.663158 0.665468 0.649448 4000 NaN NaN NaN NaN 6000 NaN NaN NaN NaN 8000 NaN NaN NaN NaN 10000 NaN NaN NaN NaN 20000 NaN NaN NaN NaN 30000 NaN NaN NaN NaN 50000 NaN NaN NaN NaN'''), sep='\s+') # Do the original interpolation df.interpolate(method='nearest', xis=0, inplace=True) # Display result print ('Interpolated data:') print (df) print () # Function to curve fit to the data def func(x, a, b, c, d): return a * (x ** 3) + b * (x ** 2) + c * x + d # Initial parameter guess, just to kick off the optimization guess = (0.5, 0.5, 0.5, 0.5) # Create copy of data to remove NaNs for curve fitting fit_df = df.dropna() # Place to store function parameters for each column col_params = {} # Curve fit each column for col in fit_df.columns: # Get x & y x = fit_df.index.astype(float).values y = fit_df[col].values # Curve fit column and get curve parameters params = curve_fit(func, x, y, guess) # Store optimized parameters col_params[col] = params[0] # Extrapolate each column for col in df.columns: # Get the index values for NaNs in the column x = df[pd.isnull(df[col])].index.astype(float).values # Extrapolate those points with the fitted function df[col][x] = func(x, *col_params[col]) # Display result print ('Extrapolated data:') print (df) print () print ('Data was extrapolated with these column functions:') for col in col_params: print ('f_{}(x) = {:0.3e} x^3 + {:0.3e} x^2 + {:0.4f} x + {:0.4f}'.format(col, *col_params[col])) </code></pre> <h2>外推结果</h2> <pre class="lang-none prettyprint-override"><code>Interpolated data: neg neu pos avg 0 NaN NaN NaN NaN 250 0.508475 0.527027 0.641292 0.558931 500 0.508475 0.527027 0.641292 0.558931 1000 0.650000 0.571429 0.653983 0.625137 2000 0.650000 0.571429 0.653983 0.625137 3000 0.619718 0.663158 0.665468 0.649448 4000 NaN NaN NaN NaN 6000 NaN NaN NaN NaN 8000 NaN NaN NaN NaN 10000 NaN NaN NaN NaN 20000 NaN NaN NaN NaN 30000 NaN NaN NaN NaN 50000 NaN NaN NaN NaN Extrapolated data: neg neu pos avg 0 0.411206 0.486983 0.631233 0.509807 250 0.508475 0.527027 0.641292 0.558931 500 0.508475 0.527027 0.641292 0.558931 1000 0.650000 0.571429 0.653983 0.625137 2000 0.650000 0.571429 0.653983 0.625137 3000 0.619718 0.663158 0.665468 0.649448 4000 0.621036 0.969232 0.708464 0.766245 6000 1.197762 2.799529 0.991552 1.662954 8000 3.281869 7.191776 1.702860 4.058855 10000 7.767992 15.272849 3.041316 8.694096 20000 97.540944 150.451269 26.103320 91.365599 30000 381.559069 546.881749 94.683310 341.042883 50000 1979.646859 2686.936912 467.861511 1711.489069 Data was extrapolated with these column functions: f_neg(x) = 1.864e-11 x^3 + -1.471e-07 x^2 + 0.0003 x + 0.4112 f_neu(x) = 2.348e-11 x^3 + -1.023e-07 x^2 + 0.0002 x + 0.4870 f_avg(x) = 1.542e-11 x^3 + -9.016e-08 x^2 + 0.0002 x + 0.5098 f_pos(x) = 4.144e-12 x^3 + -2.107e-08 x^2 + 0.0000 x + 0.6312 </code></pre> <h2><code>avg</code>列的绘图</h2> <a href="https://www.desmos.com/calculator/sx3wbe037u" rel="nofollow noreferrer"><img src="https://s3.amazonaws.com/grapher/exports/ltn4krqrf4.png" alt="Extrapolated Data"/></a> 如果没有更大的数据集或不知道数据的来源，这个结果可能是完全错误的，但是应该举例说明推断<code>DataFrame</code>的过程。在<code>func()</code>中假设的方程可能需要用来进行运算，以得到正确的外推。此外，没有试图使代码有效。 更新： 如果索引是非数字的，比如<code>DatetimeIndex</code>，那么<a href="https://stackoverflow.com/a/35960833/2087463">see this answer</a>是如何推断它们的。

Pandas数据框中的外推值

1 个回答

相关Python问题