使用其他非NaN行值填充datetime索引数据帧中的NaN行

2024-06-26 00:18:35 发布

您现在位置:Python中文网/ 问答频道 /正文

我有一个大的weather csv数据框,包含数十万行和许多列。这些行是多年来每10分钟采样一次的时间序列。表示日期时间的索引数据列由年、月、日、小时、分钟和秒组成。不幸的是,有几千行只包含NaN。目标是使用在同一时间收集的其他行的值填充这些行,但如果不是NAN,则使用其他年份的值填充这些行

我编写了一个python for循环代码,但这似乎是一个非常耗时的解决方案。我需要您的帮助,以获得更高效、更快的解决方案

原始数据帧如下所示:

print(df)
                    p (mbar)  T (degC)  Tpot (K)  Tdew (degC)  rh (%)
datetime                                                              
2004-01-01 00:10:00    996.52     -8.02    265.40        -8.90   93.30
2004-01-01 00:20:00    996.57     -8.41    265.01        -9.28   93.40
2004-01-01 00:40:00    996.51     -8.31    265.12        -9.07   94.20
2004-01-01 00:50:00    996.51     -8.27    265.15        -9.04   94.10
2004-01-01 01:00:00    996.53     -8.51    264.91        -9.31   93.90
...                       ...       ...       ...          ...     ...
2020-12-31 23:20:00   1000.07     -4.05    269.10        -8.13   73.10
2020-12-31 23:30:00    999.93     -3.35    269.81        -8.06   69.71
2020-12-31 23:40:00    999.82     -3.16    270.01        -8.21   67.91
2020-12-31 23:50:00    999.81     -4.23    268.94        -8.53   71.80
2021-01-01 00:00:00    999.82     -4.82    268.36        -8.42   75.70

[820551 rows x 5 columns]

由于任何原因,df数据帧中缺少行。要识别它们,可以应用以下功能:

findnanrows(df.groupby(pd.Grouper(freq='10T')).mean())
                     p (mbar)  T (degC)  Tpot (K)  Tdew (degC)  rh (%)
datetime 
2004-01-01 00:30:00       NaN       NaN       NaN          NaN     NaN                                                             
2009-10-08 09:50:00       NaN       NaN       NaN          NaN     NaN
2009-10-08 10:00:00       NaN       NaN       NaN          NaN     NaN
2013-05-16 09:00:00       NaN       NaN       NaN          NaN     NaN
2014-07-30 08:10:00       NaN       NaN       NaN          NaN     NaN
...                       ...       ...       ...          ...     ...
2016-10-28 12:00:00       NaN       NaN       NaN          NaN     NaN
2016-10-28 12:10:00       NaN       NaN       NaN          NaN     NaN
2016-10-28 12:20:00       NaN       NaN       NaN          NaN     NaN
2016-10-28 12:30:00       NaN       NaN       NaN          NaN     NaN
2016-10-28 12:40:00       NaN       NaN       NaN          NaN     NaN

[5440 rows x 5 columns]

目的是填充所有这些行。例如,与datetime 2004-01-01 00:30:00相对应的第一个NaN行应填充在另一年的同一datetime xxxx-01-01 00:30:00上收集的另一行的not NaN值,如2005-01-01 00:30:002006-01-01 00:30:00等,即使存在2003-01-01 00:30:002002-01-01 00:30:00。可以在所有其他年份应用平均值

查看datetime索引为2005-01-01 00:30:00的行的值:

print(df.loc["2005-01-01 00:30:00", :])
                     p (mbar)  T (degC)  Tpot (K)  Tdew (degC)  rh (%)
datetime                                                              
2005-01-01 00:30:00    996.36     12.67    286.13         7.11   68.82

使用具有索引datetime2005-01-01 00:30:00的行的值填充与索引datetime2004-01-01 00:30:00对应的行后,df数据帧将具有以下行:

print(df.loc["2004-01-01 00:30:00", :])
                     p (mbar)  T (degC)  Tpot (K)  Tdew (degC)  rh (%)
datetime                                                              
2004-01-01 00:30:00    996.36     12.67    286.13         7.11   68.82

我创建的两个函数如下所示。第一个是识别NaN行。第二个是填充它们

def findnanrows(df):
    is_NaN = df.isnull() 
    row_has_NaN = is_NaN.any(axis=1) 
    rows_with_NaN = df[row_has_NaN] 
    return rows_with_NaN

def filldata(weatherdata):
    fillweatherdata = weatherdata.copy()
    allyears = fillweatherdata.index.year.unique().tolist()
    dfnan = findnanrows(fillweatherdata.groupby(pd.Grouper(freq='10T')).mean())
    for i in range(dfnan.shape[0]):
        dnan = dfnan.index[i]
        if dnan.year == min(allyears):
            y = 0
            dnew = dnan.replace(year=dnan.year+y)
            while dnew in dfnan.index:
                dnew = dnew.replace(year=dnew.year+y)
                y += 1   
        else:
            y = 0
            dnew = dnan.replace(year=dnan.year-y)
            while dnew in dfnan.index:
                dnew = dnew.replace(year=dnew.year-y)
                y += 1
        new_row = pd.DataFrame(np.array([fillweatherdata.loc[dnew, :]]).tolist(), columns=fillweatherdata.columns.tolist(), index=[dnan])
        fillweatherdata = pd.concat([fillweatherdata, pd.DataFrame(new_row)], ignore_index=False)
    #fillweatherdata = fillweatherdata.drop_duplicates()
    fillweatherdata = fillweatherdata.sort_index()
    return fillweatherdata

Tags: 数据dfdatetimeindexnanyearpdtpot