<p>由于数据是连续的,如果感兴趣区域的开始和结束接近文件的开头,那么从文件末尾读取(以找到匹配的端点)仍然是一个糟糕的解决方案!</p>
<p>我已经编写了一些代码,可以根据您的需要快速找到起点和终点,这种方法称为<a href="http://en.wikipedia.org/wiki/Binary_search_algorithm" rel="nofollow">binary search</a>,类似于clasic儿童的“高或低”猜谜游戏!</p>
<p>脚本读取<code>lower_bounds</code>和<code>upper_bounds</code>(最初是SOF和EOF)之间的一条测试线,并检查匹配条件。如果查找的行更早,那么它将通过读取<code>lower_bound</code>和上一次读取尝试之间的一行进行再次猜测(如果它的值较高,那么它将在猜测值和上限值之间拆分)。所以你在上下界之间不断迭代-这会产生最快的“平均”解。</p>
<p>这应该是一个真正快速的解决方案(以2为基数记录行数!!)。例如,在最坏的情况下(在1000行中查找第999行),使用二进制搜索只需要读取9行!(10亿条线路只需30条……)</p>
<p>以下代码的假设:</p>
<ul>
<li>每一行都以时间信息开头。在</li>
<li>时间是唯一的-如果不是,当找到匹配项时,您必须向后或向前检查,以包括或排除具有匹配时间的所有条目(如果需要)。在</li>
<li>有趣的是,这是一个递归函数,所以文件的行数限制在2**1000(幸运的是,这允许相当大的文件…)</li>
</ul>
<p>进一步:</p>
<ul>
<li>如果愿意的话,这可以适应于在任意块中读取,而不是按行读取。正如J.F.塞巴斯蒂安建议的那样。在</li>
<li>在我最初的回答中,我建议使用这个方法,但是使用<a href="http://docs.python.org/library/linecache.html" rel="nofollow">linecache.getline</a>,虽然这可能不适合大文件,因为它将整个文件读入内存(因此{<cd4>}更好),这要感谢TerryE和J.F.Sebastian的指出。在</li>
</ul>
<p>导入日期时间</p>
<pre><code>def match(line):
lfmt = '%Y-%m-%d %H:%M:%S'
if line[0] == '[':
return datetime.datetime.strptime(line[1:20], lfmt)
def retrieve_test_line(position):
file.seek(position,0)
file.readline() # avoids reading partial line, which will mess up match attempt
new_position = file.tell() # gets start of line position
return file.readline(), new_position
def check_lower_bound(position):
file.seek(position,0)
new_position = file.tell() # gets start of line position
return file.readline(), new_position
def find_line(target, lower_bound, upper_bound):
trial = int((lower_bound + upper_bound) /2)
inspection_text, position = retrieve_test_line(trial)
if position == upper_bound:
text, position = check_lower_bound(lower_bound)
if match(text) == target:
return position
return # no match for target within range
matched_position = match(inspection_text)
if matched_position == target:
return position
elif matched_position < target:
return find_line(target, position, upper_bound)
elif matched_position > target:
return find_line(target, lower_bound, position)
else:
return # no match for target within range
lfmt = '%Y-%m-%d %H:%M:%S'
# start_target = # first line you are trying to find:
start_target = datetime.datetime.strptime("2012-02-01 13:10:00", lfmt)
# end_target = # last line you are trying to find:
end_target = datetime.datetime.strptime("2012-02-01 13:39:00", lfmt)
file = open("log_file.txt","r")
lower_bound = 0
file.seek(0,2) # find upper bound
upper_bound = file.tell()
sequence_start = find_line(start_target, lower_bound, upper_bound)
if sequence_start or sequence_start == 0: #allow for starting at zero - corner case
sequence_end = find_line(end_target, sequence_start, upper_bound)
if not sequence_end:
print "start_target match: ", sequence_start
print "end match is not present in the current file"
else:
print "start match is not present in the current file"
if (sequence_start or sequence_start == 0) and sequence_end:
print "start_target match: ", sequence_start
print "end_target match: ", sequence_end
print
print start_target, 'target'
file.seek(sequence_start,0)
print file.readline()
print end_target, 'target'
file.seek(sequence_end,0)
print file.readline()
</code></pre>