<p>下面的方法有望有所帮助。它的设计速度更快,并减少内存需求:</p>
<pre><code>from heapq import merge
from itertools import groupby, ifilter
def get_click_entries(key):
with open('clicks.csv', 'rb') as f_clicks:
for entry in ifilter(lambda x: int(x[0]) == key, csv.reader(f_clicks)):
entry.insert(4, '') # add empty missing column
yield entry
# First create a set holding all column 0 click entries
with open('clicks.csv', 'rb') as f_clicks:
csv_clicks = csv.reader(f_clicks)
click_keys = {int(cols[0]) for cols in csv_clicks}
with open('buys.csv', 'rb') as f_buys, \
open('clicks.csv', 'rb') as f_clicks, \
open('merged.csv', 'wb') as f_merged:
csv_buys = csv.reader(f_buys)
csv_clicks = csv.reader(f_clicks)
csv_merged = csv.writer(f_merged)
for k, g in groupby(csv_buys, key=lambda x: int(x[0])):
if k in click_keys:
buys = sorted(g, key=lambda x: (x[1], x[2]))
clicks = sorted(get_click_entries(k), key=lambda x: (x[1], x[2]))
csv_merged.writerows(merge(buys, clicks)) # merge the two lists based on the timestamp
click_keys.remove(k)
csv_merged.writerows(g)
# Write any remaining click entries
for k in click_keys:
csv_merged.writerows(get_click_entries(k))
</code></pre>
<p>对于两个示例文件,将生成以下输出:</p>
^{pr2}$
<p>它的工作原理是首先创建一组列0中的所有条目,然后这意味着您可以避免在知道条目不存在的情况下重新读取整个click文件。然后,它尝试从<code>buys</code>读取一组匹配的列0项,并从<code>clicks</code>读取列0项的对应列表。然后根据时间戳对它们进行排序,并按顺序合并在一起。然后从集合中删除此项,这样就不会重新读取它们。在</p>