擅长:python、mysql、java
<p>你说你有“大约500GB的文本文件”。如果我理解正确,你并没有为每个观察设置固定的长度(注意,我不是说<em>行数</em>,我是指观察所有行的总长度,以字节为单位)。这意味着您必须遍历整个文件,因为您无法确切知道新行将在哪里。在</p>
<p>现在,根据每个文本文件的大小,您可能需要寻找不同的答案。但是如果每个文件足够小(小于1GB?),您可能可以使用<a href="http://docs.python.org/library/linecache.html" rel="noreferrer">^{<cd1>}</a>模块,该模块为您处理按行搜索。在</p>
<p>你可能会这样使用它:</p>
<pre><code>import linecache
filename = 'observations1.txt'
# Start at 44th line
curline = 44
lines = []
# Keep looping until no return string is found
# getline() never throws errors, but returns an empty string ''
# if the line wasn't found (if the line was actually empty, it would have
# returned the newline character '\n')
while linecache.getline(filename, curline):
for i in xrange(75):
lines.append(linecache.getline(filename, curline).rstrip())
curline += 1
# Perform work with the set of observation lines
add_to_observation_log(lines)
# Skip the unnecessary section and reset the lines list
curline += 4
lines = []
</code></pre>
<p>我试过这个测试,它在五秒钟内就把一个23MB的文件给啃了。在</p>