基于组和前一行pandas的正向填充(ffill)

2024-10-01 13:33:27 发布

您现在位置:Python中文网/ 问答频道 /正文

我有一个大数据帧(400000多行),如下所示:

data = np.array([
          [1949, '01/01/2018', np.nan, 17,     '30/11/2017'],
          [1949, '01/01/2018', np.nan, 19,      np.nan],
          [1811, '01/01/2018',     16, np.nan, '31/11/2017'],
          [1949, '01/01/2018',     15, 21,     '01/12/2017'],
          [1949, '01/01/2018', np.nan, 20,      np.nan],
          [3212, '01/01/2018',     21, 17,     '31/11/2017']
         ])
columns = ['id', 'ReceivedDate', 'PropertyType', 'MeterType', 'VisitDate']
pd.DataFrame(data, columns=columns)

合成df:

^{pr2}$

我想根据groupby(id&received date)转发填充-仅当它们在索引中按顺序排在下一个位置时(即,仅向前填充索引位置1和4)。在

我想有一个专栏,说明是否应该根据标准填写,但我如何检查上面的行?在

(我计划按照这个答案使用一个解决方案:pandas fill forward performance issue

df.isnull().astype(int)).groupby(level=0).cumsum().applymap(lambda x: None if x == 0 else 1)

因为x = df.groupby(['id','ReceivedDate']).ffill()非常慢。)

期望测向:

     id     ReceivedDate    PropertyType    MeterType   VisitDate
0   1949    01/01/2018       NaN              17       30/11/2017
1   1949    01/01/2018       NaN              19       30/11/2017
2   1811    01/01/2018       16              NaN       31/11/2017
3   1949    01/01/2018       15               21       01/12/2017
4   1949    01/01/2018       15               20       01/12/2017
5   3212    01/01/2018       21               17       31/11/2017

Tags: columns数据iddfdatanpnanarray
2条回答

groupby和{}与limit=1

df.groupby(['id', 'ReceivedDate']).ffill(limit=1)

     id ReceivedDate PropertyType MeterType   VisitDate
0  1949   01/01/2018          NaN        17  30/11/2017
1  1949   01/01/2018          NaN        19  30/11/2017
2  1811   01/01/2018           16        18  31/11/2017
3  1949   01/01/2018           15        21  01/12/2017
4  1949   01/01/2018           15        20  01/12/2017
5  3212   01/01/2018           21        17  31/11/2017

groupbymask和{}

尝试用groupbymask、和shift-

^{pr2}$

df.mask(df.isnull().astype(int).groupby(j).cumsum().eq(1), df.groupby(j).shift())

或者

df.where(df.isnull().astype(int).groupby(j).cumsum().ne(1), df.groupby(j).shift())

     id ReceivedDate PropertyType MeterType   VisitDate
0  1949   01/01/2018          NaN        17  30/11/2017
1  1949   01/01/2018          NaN        19  30/11/2017
2  1811   01/01/2018           16        18  31/11/2017
3  1949   01/01/2018           15        21  01/12/2017
4  1949   01/01/2018           15        20  01/12/2017
5  3212   01/01/2018           21        17  31/11/2017
cols_to_ffill = ['PropertyType', 'VisitDate']
i = df.copy()

newdata = pd.DataFrame(['placeholder'] )

while not newdata.index.empty:

    RowAboveid = i.id.shift()
    RowAboveRD = i.ReceivedDate.shift()
    rows_with_cols_to_ffill_all_empty = i.loc[:, cols_to_ffill].isnull().all(axis=1)
    rows_to_ffill = (i.ReceivedDate == RowAboveRD) & (i.id == RowAboveid) & (rows_with_cols_to_ffill_all_empty)
    rows_used_to_fill = i[rows_to_ffill].index-1

    newdata = i.loc[rows_used_to_fill, cols_to_ffill]
    newdata.index +=1
    i.loc[rows_to_ffill, cols_to_ffill] = newdata

一直循环,直到不再匹配为止(即所有列都是前向填充的)

相关问题 更多 >