pyarrow 1.0版错误在使用ParquetDataset读取大量文件时抛出内存不足异常（适用于0.13版）

2024-09-28 01:27:41 发布

男 | 程序猿一只，喜欢编程写python代码。

我有一个数据帧拆分，存储在5000多个文件中。我使用ParquetDataset（fnames）.read（）加载所有文件。我将pyarrow从0.13.0更新到最新版本1.0.1，它开始抛出“OSError:内存不足：大小131072的malloc失败”。同一台机器上的相同代码仍然适用于旧版本。我的机器有256Gb的内存，足以加载需要<；10Gb。您可以使用下面的代码生成您这边的问题

    # create a big dataframe
    import pandas as pd
    import numpy as np

    df = pd.DataFrame({'A': np.arange(50000000)})
    df['F1'] = np.random.randn(50000000) * 100
    df['F2'] = np.random.randn(50000000) * 100
    df['F3'] = np.random.randn(50000000) * 100
    df['F4'] = np.random.randn(50000000) * 100
    df['F5'] = np.random.randn(50000000) * 100
    df['F6'] = np.random.randn(50000000) * 100
    df['F7'] = np.random.randn(50000000) * 100
    df['F8'] = np.random.randn(50000000) * 100
    df['F9'] = 'ABCDEFGH'
    df['F10'] = 'ABCDEFGH'
    df['F11'] = 'ABCDEFGH'
    df['F12'] = 'ABCDEFGH01234'
    df['F13'] = 'ABCDEFGH01234'
    df['F14'] = 'ABCDEFGH01234'
    df['F15'] = 'ABCDEFGH01234567'
    df['F16'] = 'ABCDEFGH01234567'
    df['F17'] = 'ABCDEFGH01234567'

    # split and save data to 5000 files
    for i in range(5000):
        df.iloc[i*10000:(i+1)*10000].to_parquet(f'{i}.parquet', index=False)

    # use a fresh session to read data

    # below code works to read
    import pandas as pd
    df = []
    for i in range(5000):
        df.append(pd.read_parquet(f'{i}.parquet'))

    df = pd.concat(df)


    # below code crashes with memory error in pyarrow 1.0/1.0.1 (works fine with version 0.13.0)
    # tried use_legacy_dataset=False, same issue
    import pyarrow.parquet as pq

    fnames = []
    for i in range(5000):
        fnames.append(f'{i}.parquet')

    len(fnames)

    df = pq.ParquetDataset(fnames).read(use_threads=False)

Tags： to in import df read as np random

0条回答

目前没有回答

pyarrow 1.0版错误在使用ParquetDataset读取大量文件时抛出内存不足异常（适用于0.13版）

相关问题更多 >

编程相关推荐

热门问题

热门文章

pyarrow 1.0版错误在使用ParquetDataset读取大量文件时抛出内存不足异常（适用于0.13版）

相关问题 更多 >

编程相关推荐

热门问题

热门文章

相关问题更多 >