回答此问题可获得 20 贡献值,回答如果被采纳可获得 50 分。
<p>我有一个类似于以下内容的数据帧:</p>
<pre><code>import pandas as pd
import numpy as np
date = pd.date_range(start='2020-01-01', freq='H', periods=4)
locations = ["AA3", "AB1", "AD1", "AC0"]
x = [5.5, 10.2, np.nan, 2.3, 11.2, np.nan, 2.1, 4.0, 6.1, np.nan, 20.3, 11.3, 4.9, 15.2, 21.3, np.nan]
df = pd.DataFrame({'x': x})
df.index = pd.MultiIndex.from_product([locations, date], names=['location', 'date'])
df = df.sort_index()
df
</code></pre>
<pre><code> x
location date
AA3 2020-01-01 00:00:00 5.5
2020-01-01 01:00:00 10.2
2020-01-01 02:00:00 NaN
2020-01-01 03:00:00 2.3
AB1 2020-01-01 00:00:00 11.2
2020-01-01 01:00:00 NaN
2020-01-01 02:00:00 2.1
2020-01-01 03:00:00 4.0
AC0 2020-01-01 00:00:00 4.9
2020-01-01 01:00:00 15.2
2020-01-01 02:00:00 21.3
2020-01-01 03:00:00 NaN
AD1 2020-01-01 00:00:00 6.1
2020-01-01 01:00:00 NaN
2020-01-01 02:00:00 20.3
2020-01-01 03:00:00 11.3
</code></pre>
<p>索引值是位置代码和一天中的小时数。我想用同一天和同一小时内最近位置的同一列的有效值来填充<code>x</code>列缺少的值,其中每个位置到其他位置的距离定义为</p>
<pre><code>nearest = pd.DataFrame({"AA3": ["AA3", "AB1", "AD1", "AC0"],
"AB1": ["AB1", "AA3", "AC0", "AD1"],
"AD1": ["AD1", "AC0", "AB1", "AA3"],
"AC0": ["AC0", "AD1", "AA3", "AB1"]})
nearest
</code></pre>
<pre><code> AA3 AB1 AD1 AC0
0 AA3 AB1 AD1 AC0
1 AB1 AA3 AC0 AD1
2 AD1 AC0 AB1 AA3
3 AC0 AD1 AA1 AB1
</code></pre>
<p>在此数据集中,列名是位置代码,每列下的行值按其与名称为列名的位置的接近程度指示其他位置</p>
<p>如果最近的位置在同一天和同一小时也缺少值,那么我将取第二个最近的位置在同一天和同一小时的值。如果第二个最近的位置丢失,则第三个最近的位置在同一天和同一小时,依此类推</p>
<p>期望输出:</p>
<pre><code> x
location date
AA3 2020-01-01 00:00:00 5.5
2020-01-01 01:00:00 10.2
2020-01-01 02:00:00 2.1
2020-01-01 03:00:00 2.3
AB1 2020-01-01 00:00:00 11.2
2020-01-01 01:00:00 10.2
2020-01-01 02:00:00 2.1
2020-01-01 03:00:00 4.0
AC0 2020-01-01 00:00:00 4.9
2020-01-01 01:00:00 15.2
2020-01-01 02:00:00 21.3
2020-01-01 03:00:00 11.3
AD1 2020-01-01 00:00:00 6.1
2020-01-01 01:00:00 15.2
2020-01-01 02:00:00 20.3
2020-01-01 03:00:00 11.3
</code></pre>
<p>以下基于<a href="https://stackoverflow.com/users/5972189/kiona1018">@kiona1018</a>的建议按预期工作,但速度较慢</p>
<pre><code>def fillna_by_nearest(x: pd.Series, nn_data: pd.DataFrame):
out = x.copy()
for index, value in x.iteritems():
if np.isnan(value) and (index[0] in nn_data.columns):
location, date = index
for near_location in nn_data[location]:
if ((near_location, date) in x.index) and pd.notna(x.loc[near_location, date]):
out.loc[index] = x.loc[near_location, date]
break
return out
fillna_by_nearest(df['x'], nearest)
</code></pre>