Dask内存不足，即使有块

from dask import dataframe as dd BLOCKSIZE = 64000000 # = 64 Mb chunks df1_file_path = './mRNA_TCGA_breast.csv' df2_file_path = './miRNA_TCGA_breast.csv' # Gets Dataframes df1 = dd.read_csv( df1_file_path, delimiter='\t', blocksize=BLOCKSIZE ) first_column = df1.columns.values[0] df1.set_index(first_column) df2 = dd.read_csv( df2_file_path, delimiter='\t', blocksize=BLOCKSIZE ) first_column = df2.columns.values[0] df2.set_index(first_column) # Filter common columns common_columns = df1.columns.intersection(df2.columns) df1 = df1[common_columns] df2 = df2[common_columns]

# Computes a Cartesian product df1['_tmpkey'] = 1 df2['_tmpkey'] = 1 # Neither of these two options work # df1.merge(df2, on='_tmpkey').drop('_tmpkey', axis=1).to_hdf('/tmp/merge.*.hdf', key='/merge_data') # df1.merge(df2, on='_tmpkey').drop('_tmpkey', axis=1).to_parquet('/tmp/')

1条回答

网友

1楼 · 发布于 2024-10-01 15:38:49

我使用以下方法成功地运行了您的代码，内存限制为32GB

我已经去掉了参数BLOCKSIZE，在df1和df2上使用了repartition

df1 = df1.repartition(npartitions=50)
df2 = df2.repartition(npartitions=1)

请注意，与df1相比，df2的大小实际上更小（2.5 MB vs 23.75 MB），这就是为什么我只为df2保留了一个分区，并将df1分成50个分区的原因

这样做应该使代码为您工作。对我来说，使用的内存保持在12GB以下

为了检查，我计算了结果的len：

len(df) # 3001995

按照上面的内容创建一个有50个分区的拼花地板文件。您可以再次使用repartition来获得所需的分区大小

NB:

添加此选项可以加快代码的速度：

from dask.distributed import Client
client = Client()

在我的例子中，由于我的运行环境，我不得不使用参数Client(processes=False)

相关问题更多 >

编程相关推荐

热门问题

热门文章