<p>假设你有足够的内存,你最好在内存中对文件进行排序,比如把它放到字典里,然后一次把它写到磁盘上。如果I/O确实是您的瓶颈,那么您只需打开一次输出文件就可以获得很多好处。你知道吗</p>
<pre><code>from collections import defaultdict
from os.path import join
file_path = ".../master.tsv"
data = collections.defaultdict(list)
with open(file_path, 'r') as masterfile:
for line in masterfile:
cik = line.split("|", 1)[0].zfill(10)
data[cik].append(line)
for cik, lines in data.items():
save_path = join(".../data-sorted", cik + ".csv")
with open(save_path, 'w') as savefile:
wr = csv.writer(savefile, quoting=csv.QUOTE_ALL)
for line in lines:
wr.writerow(line.split("|"))
</code></pre>
<p>您可能没有足够的内存来加载整个文件。在这种情况下,您可以将其转储为块,如果块足够大,最终仍会为您节省大量的I/O。下面的分块方法非常快速和肮脏。你知道吗</p>
<pre><code>from collections import defaultdict
from itertools import groupby
from os.path import join
chunk_size = 10000 # units of lines
file_path = ".../master.tsv"
with open(file_path, 'r') as masterfile:
for _, chunk in groupby(enumerate(masterfile),
key=lambda item: item[0] // chunk_size):
data = defaultdict(list)
for line in chunk:
cik = line.split("|", 1)[0].zfill(10)
data[cik].append(line)
for cik, lines in data.items():
save_path = join(".../data-sorted", cik + ".csv")
with open(save_path, 'a') as savefile:
wr = csv.writer(savefile, quoting=csv.QUOTE_ALL)
for line in lines:
wr.writerow(line.split("|"))
</code></pre>