<p>这种方法最终奏效了。下面是我的代码模型:</p>
<pre><code>import csv
idata = open("KEY_ABC.csv","rU")
odata = open("KEY_XYZ.csv","rU")
leftdata = csv.reader(idata)
rightdata = csv.reader(odata)
def gen_chunks(reader, chunksize=1000000):
chunk = []
for i, line in enumerate(reader):
if (i % chunksize == 0 and i > 0):
yield chunk
del chunk[:]
chunk.append(line)
yield chunk
count = 0
d1 = dict([(rows[3],rows[0]) for rows in rightdata])
odata.seek(0)
d2 = dict([(rows[3],rows[1]) for rows in rightdata])
odata.seek(0)
d3 = dict([(rows[3],rows[2]) for rows in rightdata])
for chunk in gen_chunks(leftdata):
res = [[k[0], k[1], k[2], k[3], k[4], k[5], k[6],
d1.get(k[6], "NaN")] for k in chunk]
res1 = [[k[0], k[1], k[2], k[3], k[4], k[5], k[6], k[7],
d2.get(k[6], "NaN")] for k in res]
res2 = [[k[0], k[1], k[2], k[3], k[4], k[5], k[6], k[7], k[8],
d3.get(k[6], "NaN")] for k in res1]
namestart = "FINAL_"
nameend = ".csv"
count = count+1
filename = namestart + str(count) + nameend
with open(filename, "wb") as csvfile:
output = csv.writer(csvfile)
output.writerows(res2)
</code></pre>
<p>通过将左数据集拆分为块,将右数据集转换为每个非键列的一个字典,并将列添加到左数据集(使用字典和键匹配填充这些列),脚本成功地在大约4分钟内完成了整个左连接,并且没有内存问题。在</p>
<p>还要感谢用户<a href="https://stackoverflow.com/users/89391/miku">miku</a>,他在对<a href="https://stackoverflow.com/questions/4956984/how-do-you-split-a-csv-file-into-evenly-sized-chunks-in-python">this post</a>的评论中提供了区块生成器代码。在</p>
<p>也就是说:我非常怀疑这是最有效的方法。如果有人对改进这种方法有任何建议,请立即提出。在</p>