pyarrow 1.0版错误在使用ParquetDataset读取大量文件时抛出内存不足异常(适用于0.13版)

2024-09-28 01:27:41 发布

您现在位置:Python中文网/ 问答频道 /正文

我有一个数据帧拆分,存储在5000多个文件中。我使用ParquetDataset(fnames).read()加载所有文件。我将pyarrow从0.13.0更新到最新版本1.0.1,它开始抛出“OSError:内存不足:大小131072的malloc失败”。同一台机器上的相同代码仍然适用于旧版本。我的机器有256Gb的内存,足以加载需要<;10Gb。您可以使用下面的代码生成您这边的问题

    # create a big dataframe
    import pandas as pd
    import numpy as np

    df = pd.DataFrame({'A': np.arange(50000000)})
    df['F1'] = np.random.randn(50000000) * 100
    df['F2'] = np.random.randn(50000000) * 100
    df['F3'] = np.random.randn(50000000) * 100
    df['F4'] = np.random.randn(50000000) * 100
    df['F5'] = np.random.randn(50000000) * 100
    df['F6'] = np.random.randn(50000000) * 100
    df['F7'] = np.random.randn(50000000) * 100
    df['F8'] = np.random.randn(50000000) * 100
    df['F9'] = 'ABCDEFGH'
    df['F10'] = 'ABCDEFGH'
    df['F11'] = 'ABCDEFGH'
    df['F12'] = 'ABCDEFGH01234'
    df['F13'] = 'ABCDEFGH01234'
    df['F14'] = 'ABCDEFGH01234'
    df['F15'] = 'ABCDEFGH01234567'
    df['F16'] = 'ABCDEFGH01234567'
    df['F17'] = 'ABCDEFGH01234567'

    # split and save data to 5000 files
    for i in range(5000):
        df.iloc[i*10000:(i+1)*10000].to_parquet(f'{i}.parquet', index=False)

    # use a fresh session to read data

    # below code works to read
    import pandas as pd
    df = []
    for i in range(5000):
        df.append(pd.read_parquet(f'{i}.parquet'))

    df = pd.concat(df)


    # below code crashes with memory error in pyarrow 1.0/1.0.1 (works fine with version 0.13.0)
    # tried use_legacy_dataset=False, same issue
    import pyarrow.parquet as pq

    fnames = []
    for i in range(5000):
        fnames.append(f'{i}.parquet')

    len(fnames)

    df = pq.ParquetDataset(fnames).read(use_threads=False)

Tags: toinimportdfreadasnprandom

热门问题