我有一个数据帧拆分,存储在5000多个文件中。我使用ParquetDataset(fnames).read()加载所有文件。我将pyarrow从0.13.0更新到最新版本1.0.1,它开始抛出“OSError:内存不足:大小131072的malloc失败”。同一台机器上的相同代码仍然适用于旧版本。我的机器有256Gb的内存,足以加载需要<;10Gb。您可以使用下面的代码生成您这边的问题
# create a big dataframe
import pandas as pd
import numpy as np
df = pd.DataFrame({'A': np.arange(50000000)})
df['F1'] = np.random.randn(50000000) * 100
df['F2'] = np.random.randn(50000000) * 100
df['F3'] = np.random.randn(50000000) * 100
df['F4'] = np.random.randn(50000000) * 100
df['F5'] = np.random.randn(50000000) * 100
df['F6'] = np.random.randn(50000000) * 100
df['F7'] = np.random.randn(50000000) * 100
df['F8'] = np.random.randn(50000000) * 100
df['F9'] = 'ABCDEFGH'
df['F10'] = 'ABCDEFGH'
df['F11'] = 'ABCDEFGH'
df['F12'] = 'ABCDEFGH01234'
df['F13'] = 'ABCDEFGH01234'
df['F14'] = 'ABCDEFGH01234'
df['F15'] = 'ABCDEFGH01234567'
df['F16'] = 'ABCDEFGH01234567'
df['F17'] = 'ABCDEFGH01234567'
# split and save data to 5000 files
for i in range(5000):
df.iloc[i*10000:(i+1)*10000].to_parquet(f'{i}.parquet', index=False)
# use a fresh session to read data
# below code works to read
import pandas as pd
df = []
for i in range(5000):
df.append(pd.read_parquet(f'{i}.parquet'))
df = pd.concat(df)
# below code crashes with memory error in pyarrow 1.0/1.0.1 (works fine with version 0.13.0)
# tried use_legacy_dataset=False, same issue
import pyarrow.parquet as pq
fnames = []
for i in range(5000):
fnames.append(f'{i}.parquet')
len(fnames)
df = pq.ParquetDataset(fnames).read(use_threads=False)
目前没有回答
相关问题 更多 >
编程相关推荐