阅读gz文件并获取最后24小时的python行问题的回答

阅读gz文件并获取最后24小时的python行

回答此问题可获得 20 贡献值，回答如果被采纳可获得 50 分。

0 条评论
分类：Python问答

默认排序时间排序

1 个回答

匿名 1天前

　擅长：python、mysql、java

处理日志文件通常涉及大量的数据，因此不希望每次都按升序读取，因为这会浪费大量资源。在 我马上想到的实现目标的最快方法（更好的方法肯定会存在）是一个非常简单的随机搜索：我们以相反的顺序搜索日志文件，因此从最新的第一个开始。不是访问所有行，而是任意选择一些<code>stepsize</code>，并且只查看每个<code>stepsize</code>的一些行。这样，您可以在很短的时间内搜索到千兆字节的数据。在 此外，这种方法不需要将文件的每一行存储在内存中，而只需要存储一些行和最终结果。在 当<code>a.log</code>是当前日志文件时，我们从这里开始搜索： <pre><code>with open("a.log", "rb+") as fh: </code></pre> 因为我们只对过去24小时感兴趣，所以我们先跳到末尾，然后将要搜索的时间戳保存为格式化字符串： ^{pr2}$ 现在我们可以开始随机搜索了。你的行看起来平均有65个字符长，因此我们移动了它的倍数。在 <pre><code>average_line_length = 65 stepsize = 1000 while True: # we move a step back fh.seek(index - average_line_length * stepsize, 2) # save our current position in file index = fh.tell() # we try to read a "line" (multiply avg. line length times a number # large enough to cover even large lines. Ignore largest lines here, # since this is an edge cases ruining our runtime. We rather skip # one iteration of the loop then) r = fh.read(average_line_length * 10) # our results now contains (on average) multiple lines, so we # split first lines = r.split(b"\n") # now we check for our timestring for l in lines: # your timestamps are formatted like '2018/03/28-20:08:48.985053' # I ignore minutes, seconds, ... here, just for the sake of simplicity timestr = l.split(b":") # this gives us b'2018/03/28-20' in timestr[0] # next we convert this to a datetime found_time = datetime.datetime.strptime(timestr[0], "%Y/%m/%d-%H") # finally, we compare if the found time is not inside our 24hour margin if found_time < timestamp: break </code></pre> 有了这段代码，我们只会在最后24小时内搜索每一行<code>stepsize</code>（这里：1000行）。一旦我们离开了24小时，我们知道最多我们在文件中走得太远了。在 过滤这个“过火”变得非常容易： <pre><code># read in file's contents from current position to end contents = fh.read() # split for lines lines_of_contents = contents.split(b"\n") # helper function for removing all lines older than 24 hours def check_line(line): # split to extract datestr tstr = line.split(b":") # convert this to a datetime ftime = datetime.datetime.strptime(tstr[0], "%Y/%m/%d-%H") return ftime > timestamp # remove all lines that are older than 24 hours final_result = filter(check_line, lines_of_contents) </code></pre> 由于<code>contents</code>覆盖了文件的所有剩余内容（以及<code>lines</code>所有行，这只是<code>contents</code>在换行符<code>\n</code>处拆分），所以我们可以很容易地使用<code>filter</code>来获得我们想要的结果。在 <code>lines</code>中的每一行都将被馈送给<code>check_line</code>，如果该行的时间是<code>> timestamp</code>，则返回{<cd14>}，并且{<cd16>}是我们精确描述<code>now - 1day</code>的datetime对象。这意味着<code>check_line</code>将为所有早于<code>timestamp</code>的行返回{<cd19>}，而{<cd11>}将删除这些行。在 显然，这远不是最佳的，但它很容易理解，并且很容易扩展到过滤分钟、秒。。。在 此外，覆盖多个文件也很容易：您只需要<code>glob.glob</code>来查找所有可能的文件，从最新的文件开始，然后添加另一个循环：您将搜索这些文件，直到while循环第一次失败，然后断开并读取当前文件中的所有剩余内容+之前访问过的所有文件中的所有内容。在 大致上是这样的： <pre><code>final_lines = lst() for file in logfiles: # our while-loop while True: ... # if while-loop did not break all of the current logfile's content is # <24 hours of age with open(file, "rb+") as fh: final_lines.extend(fh.readlines()) </code></pre> 这样，您只需存储日志文件的所有行，如果所有行都是&lt；24小时。如果循环在某个点中断，即我们找到了一个日志文件和一行&gt；24小时，请将<code>final_lines</code>扩展<code>final_result</code>，因为这将只覆盖24小时的行。在

阅读gz文件并获取最后24小时的python行

1 个回答

相关Python问题