<p>我有一个包含几百万行文本的大文件。我想从这个文件中随机抽取一个更小的(250000行)。我做了下面的代码,但它出人意料地非常慢,实际上慢得无法用。我能做些什么来加速呢?你知道吗</p>
<pre><code>def get_shorter_subset(fname, new_len):
"""Extract a random shorter subset of length new_len from a given file"""
out_lines = []
with open(fname + "short.out", 'w') as out_file:
with open(fname, 'r') as in_file:
all_lines = in_file.readlines()
total = len(all_lines)
print "Total lines:", total
for i in range(new_len):
line = np.random.choice(all_lines)
out_lines.append(line.rstrip('\t\r\n'))
#out_file.write(line.rstrip('\t\r\n'))
print "Done with", i, "lines"
all_lines.remove(line)
out_file.write("\n".join(out_lines))
</code></pre>