使用python正则表达式拆分多行日志条目

2024-09-30 18:20:02 发布

您现在位置:Python中文网/ 问答频道 /正文

我需要在python中创建一个正则表达式,它可以采用下面的示例并拆分每个日志条目。我使用日期作为一种方法来标识每个日志条目的开始,但是它只能从日期开始到第一行末尾的一行。它完全忽略了堆栈跟踪的所有内容。我想要所有的日志条目,因为有很多重复的日志记录,我希望能够过滤掉重复的内容,并将其减少到少数几个唯一的日志条目。我还希望能够在识别了日志条目之后删除字符串(如日期时间戳)的任何唯一性,以便比较函数可以将其标记为重复项。我尝试过使用正lookaheads和多行标志,但是没有用。有人知道我想做什么吗?在

我试过的一些正则表达式

^\d{4}-\d{2}-\d{2}.*\(.*\)$ // it matches single line date to parenthesis
^(\d{4}-\d{2}-\d{2}|\s|).*\)$ // matches single line with tabs - not much better
^\d{4}-\d{2}-\d{2}.*(?=\d{4}-\d{2}-\d{2}) // positive lookahead but barely works

示例字符串:

^{pr2}$

期望输出:

匹配1:

INFO:Starting.  (com.X.s.f.o.o)

匹配2:

SEVERE: Error attempting to s: StatusRuntimeException   (com.Y.W.Z_H.ZHGC.sHToVe)
io.G.StatusRuntimeException: EXCEEDED
    at io.G.stub.CCalls.toStatusRuntimeException(CCalls.java:227)
    at io.G.stub.CCalls.getUnchecked(CCalls.java:208)
    at io.G.stub.CCalls.blockingUnaryCall(CCalls.java:141)

匹配3:

SEVERE: Error attempt: StatusRuntimeException   (com.Y.W.Z_H.ZHGC.sHToVe)
io.G.StatusRuntimeException: EXCEEDED
    at io.G.stub.CCalls.toStatusRuntimeException(CCalls.java:227)
    at io.G.stub.CCalls.getUnchecked(CCalls.java:208)
    at io.G.stub.CCalls.blockingUnaryCall(CCalls.java:141)

匹配4:

SEVERE: getConfigInteger(): eGSWindowsPortNumber    (com.Y.W.Y_Z_config_s.YZConfigs.getInteger)

匹配5:

SEVERE: Failed to get (com.Y.W.Z_H.ZHGC.create)

Tags: to字符串iocom示例内容条目java
2条回答

无需尝试将整个字符串与regex匹配,您只需匹配日期并使用它将字符串分隔到所需的日志中:

import re

sample="""2018-03-06 11:36:40:048 INFO:Starting.  (com.X.s.f.o.o)
2018-03-06 11:36:42:931 SEVERE: Error attempting to s: StatusRuntimeException   (com.Y.W.Z_H.ZHGC.sHToVe)
io.G.StatusRuntimeException: EXCEEDED
    at io.G.stub.CCalls.toStatusRuntimeException(CCalls.java:227)
    at io.G.stub.CCalls.getUnchecked(CCalls.java:208)
    at io.G.stub.CCalls.blockingUnaryCall(CCalls.java:141)
2018-03-06 11:36:46:159 SEVERE: Error attempt: StatusRuntimeException   (com.Y.W.Z_H.ZHGC.sHToVe)
io.G.StatusRuntimeException: EXCEEDED
    at io.G.stub.CCalls.toStatusRuntimeException(CCalls.java:227)
    at io.G.stub.CCalls.getUnchecked(CCalls.java:208)
    at io.G.stub.CCalls.blockingUnaryCall(CCalls.java:141)
2018-03-06 11:36:46:824 SEVERE: getConfigInteger(): eGSWindowsPortNumber    (com.Y.W.Y_Z_config_s.YZConfigs.getInteger)
2018-03-06 11:36:46:844 SEVERE: Failed to get (com.Y.W.Z_H.ZHGC.create)"""

def date_match(s):
    """Returns true if the beginning of this string matches a date and time."""
    return bool(re.match("\d{4}-\d{2}-\d{2}\s\d{2}:\d{2}:\d{2}", s))

def yeild_matches(full_log):
    log = []
    for line in full_log.split("\n"):
        if date_match(line): # if this line starts with a date
            if len(log) > 0: # if theres already a log...
                yield "\n".join(log) # ... yield the log ...
                log = [] # ... and set the log back to nothing.

        log.append(line) # add the current line to log (list)

    yield "\n".join(log) # return the last log (theres no date at the end of the string to end the last log)

logs = list(yeild_matches(sample))

for i, l in enumerate(logs):
    print("Match {}:\n{}\n".format(i + 1, l))

yield_matches将把每一行添加到名为log的列表中,直到找到另一个日期。当它找到日期时,^{}是当前日志,并将日志设置为空。在

输出如下:

^{pr2}$

我在阅读了以下几条信息后,才明白了这一点:

python: multiline regular expression

https://www.safaribooksonline.com/library/view/python-cookbook-3rd/9781449357337/ch02s08.html

如果以下正则表达式以日期^\d{4}-\d{2}-\d{2}开头,并继续向前看{},直到第一次找到另一个日期条目.+?,并将其作为匹配项返回。这与多行字符串匹配!:天

^\d{4}-\d{2}-\d{2}.+?(?=\d{4}-\d{2}-\d{2})

下面的正则表达式将执行与@Sean Breckenridge的解决方案相同的操作,但这次要删除我要删除的字符串的唯一部分。非常有用!在

^{pr2}$

相关问题 更多 >