修剪大圆木fi

3条回答

网友

1楼 · 编辑于 2024-10-03 21:33:46

由于数据是连续的，如果感兴趣区域的开始和结束接近文件的开头，那么从文件末尾读取（以找到匹配的端点）仍然是一个糟糕的解决方案！

我已经编写了一些代码，可以根据您的需要快速找到起点和终点，这种方法称为binary search，类似于clasic儿童的“高或低”猜谜游戏！

脚本读取lower_bounds和upper_bounds（最初是SOF和EOF）之间的一条测试线，并检查匹配条件。如果查找的行更早，那么它将通过读取lower_bound和上一次读取尝试之间的一行进行再次猜测（如果它的值较高，那么它将在猜测值和上限值之间拆分）。所以你在上下界之间不断迭代-这会产生最快的“平均”解。

这应该是一个真正快速的解决方案（以2为基数记录行数！！）。例如，在最坏的情况下（在1000行中查找第999行），使用二进制搜索只需要读取9行！（10亿条线路只需30条……）

以下代码的假设：

每一行都以时间信息开头。在
时间是唯一的-如果不是，当找到匹配项时，您必须向后或向前检查，以包括或排除具有匹配时间的所有条目（如果需要）。在
有趣的是，这是一个递归函数，所以文件的行数限制在2**1000（幸运的是，这允许相当大的文件…）

进一步：

如果愿意的话，这可以适应于在任意块中读取，而不是按行读取。正如J.F.塞巴斯蒂安建议的那样。在
在我最初的回答中，我建议使用这个方法，但是使用linecache.getline，虽然这可能不适合大文件，因为它将整个文件读入内存（因此{}更好），这要感谢TerryE和J.F.Sebastian的指出。在

导入日期时间

def match(line):
    lfmt = '%Y-%m-%d %H:%M:%S'
    if line[0] == '[':
        return datetime.datetime.strptime(line[1:20], lfmt)

def retrieve_test_line(position):
    file.seek(position,0)
    file.readline()  # avoids reading partial line, which will mess up match attempt
    new_position = file.tell() # gets start of line position
    return file.readline(), new_position

def check_lower_bound(position):
    file.seek(position,0)
    new_position = file.tell() # gets start of line position
    return file.readline(), new_position

def find_line(target, lower_bound, upper_bound):
    trial = int((lower_bound + upper_bound) /2)
    inspection_text, position = retrieve_test_line(trial)
    if position == upper_bound:
        text, position = check_lower_bound(lower_bound)
        if match(text) == target:
            return position
        return # no match for target within range
    matched_position = match(inspection_text)
    if matched_position == target:
        return position
    elif matched_position < target:
        return find_line(target, position, upper_bound)
    elif matched_position > target:
        return find_line(target, lower_bound, position)
    else:
        return # no match for target within range

lfmt = '%Y-%m-%d %H:%M:%S'
# start_target =  # first line you are trying to find:
start_target =  datetime.datetime.strptime("2012-02-01 13:10:00", lfmt)
# end_target =  # last line you are trying to find:
end_target =  datetime.datetime.strptime("2012-02-01 13:39:00", lfmt)
file = open("log_file.txt","r")
lower_bound = 0
file.seek(0,2) # find upper bound
upper_bound = file.tell()

sequence_start = find_line(start_target, lower_bound, upper_bound)

if sequence_start or sequence_start == 0: #allow for starting at zero - corner case
    sequence_end = find_line(end_target, sequence_start, upper_bound)
    if not sequence_end:
        print "start_target match: ", sequence_start
        print "end match is not present in the current file"
else:
    print "start match is not present in the current file"

if (sequence_start or sequence_start == 0) and sequence_end:
    print "start_target match: ", sequence_start
    print "end_target match: ", sequence_end
    print
    print start_target, 'target'
    file.seek(sequence_start,0)
    print file.readline()
    print end_target, 'target'
    file.seek(sequence_end,0)
    print file.readline()

网友

2楼 · 编辑于 2024-10-03 21:33:46

7到10 GB是一个很大的数据量。如果要分析这类数据，我要么将应用程序记录到数据库，要么将日志文件上载到数据库。然后，您可以在数据库上高效地进行大量分析。如果您使用像Log4J这样的标准日志工具，那么将日志记录到数据库应该非常简单。只是建议另一个解决方案。

有关数据库日志记录的更多信息，请参阅以下文章：

A good database log appender for Java?

网友

3楼 · 编辑于 2024-10-03 21:33:46

5 GB log is parsed about 25 minutes

在Python can do much better (~500MB/s for ^{})中，即使是顺序O(n)扫描，也就是说，性能只受i/O的限制

要对文件执行二进制搜索，您可以调整使用固定记录的FileSearcher，使用类似于 ^{} implementation in Python（它是O(n)来扫描{}）。

为了避免O(n)（如果日期范围只选择了日志的一小部分），您可以使用一个近似的搜索，该搜索使用较大的固定块，并允许由于某些记录位于块边界上而丢失某些记录，例如，使用带record_size=1MB的未修改{}和自定义的Query类：

class Query(object):

    def __init__(self, query):
        self.query = query # e.g., '2012-01-01'

    def __lt__(self, chunk):
        # assume line starts with a date; find the start of line
        i = chunk.find('\n')
        # assert '\n' in chunk and len(chunk) > (len(self.query) + i)
        # e.g., '2012-01-01' < '2012-03-01'
        return self.query < chunk[i+1:i+1+len(self.query)]

考虑到日期范围可以跨越多个块，可以修改FileSearcher.__getitem__返回(filepos, chunk)，并搜索两次（bisect_left()，bisect_right()）以找到近似的filepos_mindate，filepos_maxdate。之后，您可以围绕给定的文件位置执行线性搜索（例如，使用tail -n方法）以找到确切的第一个和最后一个日志记录。

相关问题更多 >

编程相关推荐

热门问题

热门文章