擅长:python、mysql、java
<p>这里我做了一些假设,如果错误的话,代码会更复杂:</p>
<ul>
<li>索引将是您所说的数据集之间相似的“两个”字段(在我的示例中,用于索引的字段是第一行的两个字段)</li>
<li>“索引在数据集中出现次数不超过一次”左.csv不超过一次右.csv)在</li>
<li>你想最小化内存使用</li>
<li>字段中不存在分隔符(<code>\t</code>)</li>
</ul>
<hr/>
<pre class="lang-python prettyprint-override"><code>import mmap
indexes = {}
left_fp = open('left.csv', 'r')
left = mmap.mmap(left_fp.fileno(), 0, access=mmap.ACCESS_READ)
while True:
start = left.tell()
line = left.readline()
if not line: break
# extract only the two columns you check
cells = line.split('\t')[0:2]
# store line position in left file
indexes['\t'.join(cells)] = (start, left.tell() - start)
output = open('output.csv', 'w')
for line in open('right.csv'):
# recreate the key
cells = line.split('\t')[0:2]
pos = left_indexes[key]
if pos:
# got the left line position
left.seek(pos[0], 0)
# write it
output.write(left.read(pos[1]))
# write right row
output.write(line)
output.close()
</code></pre>