用平均值填充nan值的更快方法

d = {'B19325_038E': {409606: 9.0, 403811: 53.0, 400166: 17.0, 402573: 105.0, 400130: 43.0, 404907: 21.0, 406751: 15.0, 403850: 39.0, 404089: 81.0, 409843: np.nan}, 'B08302_014E': {409606: 2.0, 403811: 156.0, 400166: 64.0, 402573: 211.0, 400130: 140.0, 404907: 90.0, 406751: 148.0, 403850: 71.0, 404089: 341.0, 409843: 91.0}, 'B17010I_026E': {409606: np.nan, 403811: 9.0, 400166: np.nan, 402573: np.nan, 400130: np.nan, 404907: np.nan, 406751: np.nan, 403850: np.nan, 404089: np.nan, 409843: 21.0}, 'B17015_009E': {409606: 30.0, 403811: 18.0, 400166: 12.0, 402573: 5.0, 400130: 6.0, 404907: 11.0, 406751: 23.0, 403850: 49.0, 404089: 37.0, 409843: 60.0}, 'B06003_004E': {409606: 1552.0, 403811: 3562.0, 400166: 2536.0, 402573: 4911.0, 400130: 1913.0, 404907: 1888.0, 406751: 4264.0, 403850: 2087.0, 404089: 1443.0, 409843: 867.0}, 'B15001_038E': {409606: 46.0, 403811: 104.0, 400166: 89.0, 402573: 120.0, 400130: 61.0, 404907: 14.0, 406751: 60.0, 403850: 198.0, 404089: 97.0, 409843: 25.0}, 'B08130_006E': {409606: 280.0, 403811: 2325.0, 400166: 1381.0, 402573: 2907.0, 400130: 1300.0, 404907: 1528.0, 406751: 2502.0, 403850: 1278.0, 404089: 1986.0, 409843: 308.0}, 'B19201_002E': {409606: 80.0, 403811: 75.0, 400166: 24.0, 402573: 54.0, 400130: np.nan, 404907: np.nan, 406751: 43.0, 403850: 62.0, 404089: 32.0, 409843: 33.0}, 'B19325_087E': {409606: 35.0, 403811: 29.0, 400166: 33.0, 402573: 72.0, 400130: 20.0, 404907: np.nan, 406751: 39.0, 403850: 40.0, 404089: 40.0, 409843: 5.0}, 'B06003_008E': {409606: 106.0, 403811: 458.0, 400166: 296.0, 402573: 505.0, 400130: 277.0, 404907: 804.0, 406751: 1037.0, 403850: 726.0, 404089: 1854.0, 409843: 80.0}, 'B16006_003E': {409606: 30.0, 403811: 525.0, 400166: 160.0, 402573: 33.0, 400130: 386.0, 404907: 2.0, 406751: 55.0, 403850: 121.0, 404089: 686.0, 409843: 228.0}, 'C14007A_004E': {409606: np.nan, 403811: np.nan, 400166: np.nan, 402573: np.nan, 400130: np.nan, 404907: np.nan, 406751: np.nan, 403850: np.nan, 404089: np.nan, 409843: np.nan}, 'C14007A_005E': {409606: np.nan, 403811: np.nan, 400166: np.nan, 402573: np.nan, 400130: np.nan, 404907: np.nan, 406751: np.nan, 403850: np.nan, 404089: np.nan, 409843: np.nan}, 'C14007A_003E': {409606: np.nan, 403811: np.nan, 400166: np.nan, 402573: np.nan, 400130: np.nan, 404907: np.nan, 406751: np.nan, 403850: np.nan, 404089: np.nan, 409843: np.nan}, 'C21001I_003E': {409606: 31.0, 403811: 287.0, 400166: 86.0, 402573: 25.0, 400130: 235.0, 404907: 35.0, 406751: 32.0, 403850: 73.0, 404089: 384.0, 409843: 84.0}, 'C21001I_006E': {409606: np.nan, 403811: 35.0, 400166: np.nan, 402573: np.nan, 400130: np.nan, 404907: np.nan, 406751: 13.0, 403850: 17.0, 404089: 19.0, 409843: 6.0}} df = pd.DataFrame(data=d)

def fill_nan(df, col): df[col].fillna(df[col].mean(),inplace=True) return df col_list=all_data.columns.tolist() l = Parallel(n_jobs=-1)(delayed(fill_nan)(df=all_data,col=cols) for cols in col_list)

TerminatedWorkerError: A worker process managed by the executor was unexpectedly terminated. This could be caused by a segmentation fault while calling the function or by an excessive memory usage causing the Operating System to kill the worker. The exit codes of the workers are {SIGSEGV(-11)}

2条回答

网友

1楼 · 编辑于 2024-09-30 05:27:18

根据经验，.fillna()与所有列一起使用比有选择地将其应用于具有nan的列更昂贵。事实上，观察以下两个功能的结果：

def fill_nan1(df):
    col_list = df.columns.tolist()
    for col in col_list:
        df[col].fillna(df[col].mean(),inplace=True)
    return df

def fill_nan2(df):
    for col in df.columns[df.isnull().any(axis=0)]:
        df[col].fillna(df[col].mean(),inplace=True)
    return df

.fillna()应用于fill_nan1()中df的所有列（在您的情况下是如何执行的），而它仅应用于fill_nan2()中带有nan的列timeit这两种方法都会导致以下结果：

>>%timeit fill_nan1(df)
2.35 ms ± 25 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

>>%timeit fill_nan2(df)
938 µs ± 8.06 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

此外，如果这是出于ML目的，请在填充nan值之前将数据拆分为训练和测试，因为这不仅可以加快计算速度，还可以避免错误的插补

网友

2楼 · 编辑于 2024-09-30 05:27:18

你能用numpy试试这个版本吗

x = df.values
avg = np.nanmean(x, axis=0)
idx = np.nonzero(np.isnan(x))
x[idx] = np.take(avg, idx[1])

数据帧的nan值将自动更新，因为x = df.values不是数据的副本

相关问题更多 >

编程相关推荐

热门问题

热门文章