我想合并两个数据帧使用pandas
但是我得到了一个内存错误。这可能是一个内存问题,因为我的文件有大约40000000行(df1
)和80000000行,有5列(df2a
),但是,当我尝试将其他类似的文件与90000000行和5列(df2b
)合并时,合并工作正常。在
这是我的代码:
# Merge the files with pandas python
import pandas as pd
# Read lookup file from GTEx
df1 = pd.read_table("GTEx.lookup_table.txt.gz", compression="gzip", sep="\t", header=0)
df1.columns = df1.columns.str.replace('rs_id_dbSNP147_GRCh37p13', 'rsid')
df2a = pd.read_table("Proximal.nominals.FULL.txt.gz", sep=" ", header=None, compression="gzip") # this file gives the Memory error
df2b = pd.read_table("Proximal.nominals2.FULL.txt.gz", sep=" ", header=None, compression="gzip") # this file merges just fine
df2a_merge = pd.merge(left=df1, right=df2a, left_on="rsid", right_on='rsid')
df2b_merge = pd.merge(left=df1, right=df2b, left_on="rsid", right_on='rsid')
我已经查看了每个文件使用的内存量,但是df2b
占用了更多内存,但仍然合并得很好:
而且,df2a
和2f2b
中的数据类型相同:
gene_id object
rsid object
distance int64
n_pval float64
nslope float64
dtype: object
我得到的错误是:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/users/jfertaj/python/lib/python2.7/site-packages/pandas/core/reshape/merge.py", line 54, in merge
return op.get_result()
File "/users/jfertaj/python/lib/python2.7/site-packages/pandas/core/reshape/merge.py", line 569, in get_result
join_index, left_indexer, right_indexer = self._get_join_info()
File "/users/jfertaj/python/lib/python2.7/site-packages/pandas/core/reshape/merge.py", line 734, in _get_join_info
right_indexer) = self._get_join_indexers()
File "/users/jfertaj/python/lib/python2.7/site-packages/pandas/core/reshape/merge.py", line 713, in _get_join_indexers
how=self.how)
File "/users/jfertaj/python/lib/python2.7/site-packages/pandas/core/reshape/merge.py", line 998, in _get_join_indexers
return join_func(lkey, rkey, count, **kwargs)
File "pandas/_libs/join.pyx", line 71, in pandas._libs.join.inner_join (pandas/_libs/join.c:120300)
顺便说一句,我想做一个内部合并
我建议对这种类型的大型数据帧使用^{} 包。
特别是,请参见它的DataFrame,这是一种处理大熊猫数据帧并并行化其计算的方法。
您的代码可以这样修改:
相关问题 更多 >
编程相关推荐