阅读gz文件并获取最后24小时的python行

2条回答

网友

1楼 · 编辑于 2024-05-02 09:29:44

这样的事情应该行得通。在

from datetime import datetime, timedelta
import glob
import gzip
from pathlib import Path
import shutil


def open_file(path):
    if Path(path).suffix == '.gz':
        return gzip.open(path, mode='rt', encoding='utf-8')
    else:
        return open(path, encoding='utf-8')


def parsed_entries(lines):
    for line in lines:
        yield line.split(' ', maxsplit=1)


def earlier():
    return (datetime.now() - timedelta(hours=24)).strftime('%Y/%m/%d-%H:%M:%S')


def get_files():
    return ['a.log'] + list(reversed(sorted(glob.glob('a.log.*'))))


output = open('output.log', 'w', encoding='utf-8')


files = get_files()


cutoff = earlier()


for i, path in enumerate(files):
    with open_file(path) as f:
        lines = parsed_entries(f)
        # Assumes that your files are not empty
        date, line = next(lines)
        if cutoff <= date:
            # Skip files that can just be appended to the output later
            continue
        for date, line in lines:
            if cutoff <= date:
                # We've reached the first entry of our file that should be
                # included
                output.write(line)
                break
        # Copies from the current position to the end of the file
        shutil.copyfileobj(f, output)
        break
else:
    # In case ALL the files are within the last 24 hours
    i = len(files)

for path in reversed(files[:i]):
    with open_file(path) as f:
        # Assumes that your files have trailing newlines.
        shutil.copyfileobj(f, output)

# Cleanup, it would get closed anyway when garbage collected or process exits.
output.close()

如果我们制作一些测试日志文件：

^{pr2}$

然后运行我们的脚本，它输出预期的结果（对于这个时间点）

2019/01/31-00:00:00.000000 hi2
2019/01/31-19:00:00.000000 hi3

网友

2楼 · 编辑于 2024-05-02 09:29:44

处理日志文件通常涉及大量的数据，因此不希望每次都按升序读取，因为这会浪费大量资源。在

我马上想到的实现目标的最快方法（更好的方法肯定会存在）是一个非常简单的随机搜索：我们以相反的顺序搜索日志文件，因此从最新的第一个开始。不是访问所有行，而是任意选择一些stepsize，并且只查看每个stepsize的一些行。这样，您可以在很短的时间内搜索到千兆字节的数据。在

此外，这种方法不需要将文件的每一行存储在内存中，而只需要存储一些行和最终结果。在

当a.log是当前日志文件时，我们从这里开始搜索：

with open("a.log", "rb+") as fh:

因为我们只对过去24小时感兴趣，所以我们先跳到末尾，然后将要搜索的时间戳保存为格式化字符串：

^{pr2}$

现在我们可以开始随机搜索了。你的行看起来平均有65个字符长，因此我们移动了它的倍数。在

average_line_length = 65
stepsize = 1000

while True:
    # we move a step back
    fh.seek(index - average_line_length * stepsize, 2)

    # save our current position in file
    index = fh.tell()

    # we try to read a "line" (multiply avg. line length times a number
    # large enough to cover even large lines. Ignore largest lines here,
    # since this is an edge cases ruining our runtime. We rather skip
    # one iteration of the loop then)
    r = fh.read(average_line_length * 10)

    # our results now contains (on average) multiple lines, so we
    # split first
    lines = r.split(b"\n")

    # now we check for our timestring
    for l in lines:
        # your timestamps are formatted like '2018/03/28-20:08:48.985053'
        # I ignore minutes, seconds, ... here, just for the sake of simplicity
        timestr = l.split(b":")  # this gives us b'2018/03/28-20' in timestr[0]

        # next we convert this to a datetime
        found_time = datetime.datetime.strptime(timestr[0], "%Y/%m/%d-%H")

        # finally, we compare if the found time is not inside our 24hour margin
        if found_time < timestamp:
            break

有了这段代码，我们只会在最后24小时内搜索每一行stepsize（这里：1000行）。一旦我们离开了24小时，我们知道最多我们在文件中走得太远了。在

过滤这个“过火”变得非常容易：

# read in file's contents from current position to end
contents = fh.read()

# split for lines
lines_of_contents = contents.split(b"\n")

# helper function for removing all lines older than 24 hours
def check_line(line):
    # split to extract datestr
    tstr = line.split(b":")
    # convert this to a datetime
    ftime = datetime.datetime.strptime(tstr[0], "%Y/%m/%d-%H")

    return ftime > timestamp

# remove all lines that are older than 24 hours
final_result = filter(check_line, lines_of_contents)

由于contents覆盖了文件的所有剩余内容（以及lines所有行，这只是contents在换行符\n处拆分），所以我们可以很容易地使用filter来获得我们想要的结果。在

lines中的每一行都将被馈送给check_line，如果该行的时间是> timestamp，则返回{}，并且{}是我们精确描述now - 1day的datetime对象。这意味着check_line将为所有早于timestamp的行返回{}，而{}将删除这些行。在

显然，这远不是最佳的，但它很容易理解，并且很容易扩展到过滤分钟、秒。。。在

此外，覆盖多个文件也很容易：您只需要glob.glob来查找所有可能的文件，从最新的文件开始，然后添加另一个循环：您将搜索这些文件，直到while循环第一次失败，然后断开并读取当前文件中的所有剩余内容+之前访问过的所有文件中的所有内容。在

大致上是这样的：

final_lines = lst()

for file in logfiles:
    # our while-loop
    while True:
       ...
    # if while-loop did not break all of the current logfile's content is
    # <24 hours of age
    with open(file, "rb+") as fh:
        final_lines.extend(fh.readlines())

这样，您只需存储日志文件的所有行，如果所有行都是<；24小时。如果循环在某个点中断，即我们找到了一个日志文件和一行>；24小时，请将final_lines扩展final_result，因为这将只覆盖24小时的行。在

相关问题更多 >

编程相关推荐

热门问题

热门文章