如何处理pandas数据帧中特定长度序列中的缺失值？

A B 2017-01-01 -0.0053 -0.0062 2017-01-02 NaN 0.0016 2017-01-03 NaN 0.0043 2017-01-04 NaN -0.0077 2017-01-05 NaN -0.0070 2017-01-06 NaN 0.0058 2017-01-07 0.0024 -0.0074 2017-01-08 0.0018 0.0086 2017-01-09 0.0020 0.0012 2017-01-10 -0.0031 -0.0020 2017-01-11 0.0027 NaN 2017-01-12 -0.0050 NaN 2017-01-13 -0.0063 NaN 2017-01-14 0.0066 0.0095 2017-01-15 0.0039 0.0028

# imports import pandas as pd import numpy as np np.random.seed(1234) # Reproducible data sample def df_sample(rows, names): ''' Function to create data sample with random returns Parameters ========== rows : number of rows in the dataframe names: list of names to represent assets Example ======= >>> returns(rows = 2, names = ['A', 'B']) A B 2017-01-01 0.0027 0.0075 2017-01-02 -0.0050 -0.0024 ''' listVars= names rng = pd.date_range('1/1/2017', periods=rows, freq='D') df_temp = pd.DataFrame(np.random.randint(-100,100,size=(rows, len(listVars))), columns=listVars) df_temp = df_temp.set_index(rng) df_temp = df_temp / 10000 return df_temp df = df_sample(15,list('AB'))

A B 2017-01-01 -0.0053 -0.0062 2017-01-02 NaN 0.0016 2017-01-03 NaN 0.0043 2017-01-04 NaN NaN 2017-01-05 NaN NaN 2017-01-06 NaN NaN 2017-01-07 0.0024 NaN 2017-01-08 0.0018 NaN 2017-01-09 0.0020 0.0012 2017-01-10 NaN -0.0020

import pandas as pd import numpy as np np.random.seed(1234) # Reproducible data sample def df_sample(rows, names): ''' Function to create data sample with random returns Parameters ========== rows : number of rows in the dataframe names: list of names to represent assets Example ======= >>> returns(rows = 2, names = ['A', 'B']) A B 2017-01-01 0.0027 0.0075 2017-01-02 -0.0050 -0.0024 ''' listVars= names rng = pd.date_range('1/1/2017', periods=rows, freq='D') df_temp = pd.DataFrame(np.random.randint(-100,100,size=(rows, len(listVars))), columns=listVars) df_temp = df_temp.set_index(rng) df_temp = df_temp / 10000 return df_temp df = df_sample(15,list('AB')) df['A'][1:6] = np.nan df['B'][3:8] = np.nan dfi = df # convert to boolean values df = dfi df = df.isnull() # specify pattern pattern = [True,True, True, True, True] # prepare for a for loop idx = [] # loop through all columns and identify sequence of missing values for col in df: df_temp = df[col].to_frame() matched = df_temp.rolling(len(pattern)).apply(lambda x: all(np.equal(x, pattern))) matched = matched.sum(axis = 1).astype(bool) idx_matched = np.where(matched)[0] subset = [range(match-len(pattern)+1, match+1) for match in idx_matched] result = pd.concat([df.iloc[subs,:] for subs in subset], axis = 0).index idx.append(result) print(idx)

[DatetimeIndex(['2017-01-02', '2017-01-03', '2017-01-04', '2017-01-05','2017-01-06'], dtype='datetime64[ns]', freq=None), DatetimeIndex(['2017-01-04', '2017-01-05', '2017-01-06', '2017-01-07', '2017-01-08'], dtype='datetime64[ns]', freq=None)]

1条回答

网友

1楼 · 发布于 2024-09-28 03:15:13

这应该能帮你解决这个问题。它直到最后才删除行，因此它将正确地解析第二个场景中需要的多个列。我使用了您的complements部分中的df来输出下面的代码。在

说明：

我们创建另一个df，其中NaN值被分配给0，每个有限值被分配给1（如果您的初始df有零值，您需要首先将它们映射到这个虚拟的df2，然后.fillna(0).astype('bool')）
按每列的累计和进行分组，可以找到连续的NaN值的位置。然后与原始df的比较确保我们不会捕获第一个非空值。
掩码是在末尾为应该删除的任何行创建的，因此您可以为具有重叠NaN值的多个列正确地解析它。

代码如下：

import pandas as pd
import numpy as np

## If the initial df contains values of 0 do this instead of the first line below
#df2 = df.copy()
#df2[df2==0] = 0.01
#df2 = df2.fillna(0).astype('bool').cumsum()

# Min number of consecutive NaN values to begin dropping
n_cons = 5

df2 = df.fillna(0).astype('bool').cumsum()
for col in df2.columns:
    df2[col] = df2.groupby(col)[col].transform(lambda x: np.size(x) > n_cons)
    df2[col] = df2[col] & df[col].isnull()

mask = df2.any(axis=1)

df[~mask]
#                 A       B
#2017-01-01 -0.0053 -0.0062
#2017-01-09  0.0020  0.0012
#2017-01-10     NaN -0.0020

相关问题更多 >

编程相关推荐

热门问题

热门文章