修剪大圆木fi问题的回答

修剪大圆木fi

回答此问题可获得 20 贡献值，回答如果被采纳可获得 50 分。

0 条评论
分类：Python问答

默认排序时间排序

1 个回答

匿名 1天前

　擅长：python、mysql、java

由于数据是连续的，如果感兴趣区域的开始和结束接近文件的开头，那么从文件末尾读取（以找到匹配的端点）仍然是一个糟糕的解决方案！ 我已经编写了一些代码，可以根据您的需要快速找到起点和终点，这种方法称为<a href="http://en.wikipedia.org/wiki/Binary_search_algorithm" rel="nofollow">binary search</a>，类似于clasic儿童的“高或低”猜谜游戏！ 脚本读取<code>lower_bounds</code>和<code>upper_bounds</code>（最初是SOF和EOF）之间的一条测试线，并检查匹配条件。如果查找的行更早，那么它将通过读取<code>lower_bound</code>和上一次读取尝试之间的一行进行再次猜测（如果它的值较高，那么它将在猜测值和上限值之间拆分）。所以你在上下界之间不断迭代-这会产生最快的“平均”解。 这应该是一个真正快速的解决方案（以2为基数记录行数！！）。例如，在最坏的情况下（在1000行中查找第999行），使用二进制搜索只需要读取9行！（10亿条线路只需30条……） 以下代码的假设： <ul> <li>每一行都以时间信息开头。在</li> <li>时间是唯一的-如果不是，当找到匹配项时，您必须向后或向前检查，以包括或排除具有匹配时间的所有条目（如果需要）。在</li> <li>有趣的是，这是一个递归函数，所以文件的行数限制在2**1000（幸运的是，这允许相当大的文件…）</li> </ul> 进一步： <ul> <li>如果愿意的话，这可以适应于在任意块中读取，而不是按行读取。正如J.F.塞巴斯蒂安建议的那样。在</li> <li>在我最初的回答中，我建议使用这个方法，但是使用<a href="http://docs.python.org/library/linecache.html" rel="nofollow">linecache.getline</a>，虽然这可能不适合大文件，因为它将整个文件读入内存（因此{<cd4>}更好），这要感谢TerryE和J.F.Sebastian的指出。在</li> </ul> 导入日期时间 <pre><code>def match(line): lfmt = '%Y-%m-%d %H:%M:%S' if line[0] == '[': return datetime.datetime.strptime(line[1:20], lfmt) def retrieve_test_line(position): file.seek(position,0) file.readline() # avoids reading partial line, which will mess up match attempt new_position = file.tell() # gets start of line position return file.readline(), new_position def check_lower_bound(position): file.seek(position,0) new_position = file.tell() # gets start of line position return file.readline(), new_position def find_line(target, lower_bound, upper_bound): trial = int((lower_bound + upper_bound) /2) inspection_text, position = retrieve_test_line(trial) if position == upper_bound: text, position = check_lower_bound(lower_bound) if match(text) == target: return position return # no match for target within range matched_position = match(inspection_text) if matched_position == target: return position elif matched_position < target: return find_line(target, position, upper_bound) elif matched_position > target: return find_line(target, lower_bound, position) else: return # no match for target within range lfmt = '%Y-%m-%d %H:%M:%S' # start_target = # first line you are trying to find: start_target = datetime.datetime.strptime("2012-02-01 13:10:00", lfmt) # end_target = # last line you are trying to find: end_target = datetime.datetime.strptime("2012-02-01 13:39:00", lfmt) file = open("log_file.txt","r") lower_bound = 0 file.seek(0,2) # find upper bound upper_bound = file.tell() sequence_start = find_line(start_target, lower_bound, upper_bound) if sequence_start or sequence_start == 0: #allow for starting at zero - corner case sequence_end = find_line(end_target, sequence_start, upper_bound) if not sequence_end: print "start_target match: ", sequence_start print "end match is not present in the current file" else: print "start match is not present in the current file" if (sequence_start or sequence_start == 0) and sequence_end: print "start_target match: ", sequence_start print "end_target match: ", sequence_end print print start_target, 'target' file.seek(sequence_start,0) print file.readline() print end_target, 'target' file.seek(sequence_end,0) print file.readline() </code></pre>

修剪大圆木fi

1 个回答

相关Python问题