用平均值填充nan值的更快方法

2024-09-30 05:27:18 发布

您现在位置:Python中文网/ 问答频道 /正文

我已经结束了前一个问题,因此重新发布它时需要更多的上下文。我在一个相对较大(59 gb)的数据集上运行此命令。使用(800,000, 10,500)的形状,我注意到在aws ec2实例上运行df.fillna(df.mean())花费了非常长的时间,4个小时后,我刚刚取消了单元格的运行。是否有更快的方法计算平均值并填充每列的nan

这是一组数据样本

d = {'B19325_038E': {409606: 9.0, 403811: 53.0, 400166: 17.0, 402573: 105.0, 400130: 43.0, 404907: 21.0, 406751: 15.0, 403850: 39.0, 404089: 81.0, 409843: np.nan}, 'B08302_014E': {409606: 2.0, 403811: 156.0, 400166: 64.0, 402573: 211.0, 400130: 140.0, 404907: 90.0, 406751: 148.0, 403850: 71.0, 404089: 341.0, 409843: 91.0}, 'B17010I_026E': {409606: np.nan, 403811: 9.0, 400166: np.nan, 402573: np.nan, 400130: np.nan, 404907: np.nan, 406751: np.nan, 403850: np.nan, 404089: np.nan, 409843: 21.0}, 'B17015_009E': {409606: 30.0, 403811: 18.0, 400166: 12.0, 402573: 5.0, 400130: 6.0, 404907: 11.0, 406751: 23.0, 403850: 49.0, 404089: 37.0, 409843: 60.0}, 'B06003_004E': {409606: 1552.0, 403811: 3562.0, 400166: 2536.0, 402573: 4911.0, 400130: 1913.0, 404907: 1888.0, 406751: 4264.0, 403850: 2087.0, 404089: 1443.0, 409843: 867.0}, 'B15001_038E': {409606: 46.0, 403811: 104.0, 400166: 89.0, 402573: 120.0, 400130: 61.0, 404907: 14.0, 406751: 60.0, 403850: 198.0, 404089: 97.0, 409843: 25.0}, 'B08130_006E': {409606: 280.0, 403811: 2325.0, 400166: 1381.0, 402573: 2907.0, 400130: 1300.0, 404907: 1528.0, 406751: 2502.0, 403850: 1278.0, 404089: 1986.0, 409843: 308.0}, 'B19201_002E': {409606: 80.0, 403811: 75.0, 400166: 24.0, 402573: 54.0, 400130: np.nan, 404907: np.nan, 406751: 43.0, 403850: 62.0, 404089: 32.0, 409843: 33.0}, 'B19325_087E': {409606: 35.0, 403811: 29.0, 400166: 33.0, 402573: 72.0, 400130: 20.0, 404907: np.nan, 406751: 39.0, 403850: 40.0, 404089: 40.0, 409843: 5.0}, 'B06003_008E': {409606: 106.0, 403811: 458.0, 400166: 296.0, 402573: 505.0, 400130: 277.0, 404907: 804.0, 406751: 1037.0, 403850: 726.0, 404089: 1854.0, 409843: 80.0}, 'B16006_003E': {409606: 30.0, 403811: 525.0, 400166: 160.0, 402573: 33.0, 400130: 386.0, 404907: 2.0, 406751: 55.0, 403850: 121.0, 404089: 686.0, 409843: 228.0}, 'C14007A_004E': {409606: np.nan, 403811: np.nan, 400166: np.nan, 402573: np.nan, 400130: np.nan, 404907: np.nan, 406751: np.nan, 403850: np.nan, 404089: np.nan, 409843: np.nan}, 'C14007A_005E': {409606: np.nan, 403811: np.nan, 400166: np.nan, 402573: np.nan, 400130: np.nan, 404907: np.nan, 406751: np.nan, 403850: np.nan, 404089: np.nan, 409843: np.nan}, 'C14007A_003E': {409606: np.nan, 403811: np.nan, 400166: np.nan, 402573: np.nan, 400130: np.nan, 404907: np.nan, 406751: np.nan, 403850: np.nan, 404089: np.nan, 409843: np.nan}, 'C21001I_003E': {409606: 31.0, 403811: 287.0, 400166: 86.0, 402573: 25.0, 400130: 235.0, 404907: 35.0, 406751: 32.0, 403850: 73.0, 404089: 384.0, 409843: 84.0}, 'C21001I_006E': {409606: np.nan, 403811: 35.0, 400166: np.nan, 402573: np.nan, 400130: np.nan, 404907: np.nan, 406751: 13.0, 403850: 17.0, 404089: 19.0, 409843: 6.0}}

df = pd.DataFrame(data=d)

下面是我的机器的一张图片,它使用htop向您显示它运行时的状态df.fillna(df.mean()

enter image description here

正如你所看到的,它似乎在工作,但我根本没有看到内存波动,因此可能会被冻结?很难说,让它持续4个多小时是浪费金钱

有没有一种并行化df.fillna(df.mean())的方法让它运行得更快

在这里提供更多的上下文是我目前正在尝试的,因为到目前为止,似乎没有人知道

def fill_nan(df, col):
    df[col].fillna(df[col].mean(),inplace=True)
    return df

col_list=all_data.columns.tolist()
l = Parallel(n_jobs=-1)(delayed(fill_nan)(df=all_data,col=cols) for cols in col_list)

问题是我得到了这个错误

TerminatedWorkerError: A worker process managed by the executor was unexpectedly terminated. This could be caused by a segmentation fault while calling the function or by an excessive memory usage causing the Operating System to kill the worker.

The exit codes of the workers are {SIGSEGV(-11)}

尽管存在错误,但这种方法真的会使计算速度更快吗


Tags: the数据方法dfdatabynpcol
2条回答

根据经验,.fillna()与所有列一起使用比有选择地将其应用于具有nan的列更昂贵。事实上,观察以下两个功能的结果:

def fill_nan1(df):
    col_list = df.columns.tolist()
    for col in col_list:
        df[col].fillna(df[col].mean(),inplace=True)
    return df

def fill_nan2(df):
    for col in df.columns[df.isnull().any(axis=0)]:
        df[col].fillna(df[col].mean(),inplace=True)
    return df

.fillna()应用于fill_nan1()df的所有列(在您的情况下是如何执行的),而它仅应用于fill_nan2()中带有nan的列timeit这两种方法都会导致以下结果:

>>%timeit fill_nan1(df)
2.35 ms ± 25 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

>>%timeit fill_nan2(df)
938 µs ± 8.06 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

此外,如果这是出于ML目的,请在填充nan值之前将数据拆分为训练和测试,因为这不仅可以加快计算速度,还可以避免错误的插补

你能用numpy试试这个版本吗

x = df.values
avg = np.nanmean(x, axis=0)
idx = np.nonzero(np.isnan(x))
x[idx] = np.take(avg, idx[1])

数据帧的nan值将自动更新,因为x = df.values不是数据的副本

相关问题 更多 >

    热门问题