如何解析包含不确定数据模式的日志文件?

2024-10-04 05:33:15 发布

您现在位置:Python中文网/ 问答频道 /正文

对于成功的请求,日志文件如下所示

o 123456789.000 10.10.10.10 3 30 10 -
n A-123456 1452830400 1 1452
n C-73652 1452830400 1 23154
n B-967845 1452830400 1  37451
n G-809573 1452830400 1 92673   
o 123456789.000 10.10.10.10 3 30 10 some_random_text_alphanumeric_jdjfdjfdfhdkjfhdkhfdhfdfdfhkdhfkdjfdkjfkdfdkfdkjnc maxbgrsdfuyhlwkjdnkshbvhsgdvsdsjdbskdhskdjoihe73njndedejdoekekdednd

o开始并有some data(如上所示)的行,以另一行以o开始结束。这就是模式。你知道吗

如果有多个这样的请求,日志文件会继续追加,如下所示

o 123456789.000 10.10.10.10 3 30 10 -
n A-123456 1452830400 1 1452
n C-73652 1452830400 1 23154
n B-967845 1452830400 1  37451
n G-809573 1452830400 1 92673   
o 123456789.000 10.10.10.10 3 30 10 some_random_text_alphanumeric_jdjfdjfdfhdkjfhdkhfdhfdfdfhkdhfkdjfdkjfkdfdkfdkjnc maxbgrsdfuyhlwkjdnkshbvhsgdvsdsjdbskdhskdjoihe73njndedejdoekekdednd
o 123456789.000 10.10.10.10 3 30 10 -
n A-123456 1452830400 1 1452
n C-73652 1452830400 1 23154
n B-967845 1452830400 1  37451
n G-809573 1452830400 1 92673   
o 123456789.000 10.10.10.10 3 30 10 some_random_text_alphanumeric_jdjfdjfdfhdkjfhdkhfdhfdfdfhkdhfkdjfdkjfkdfdkfdkjnc maxbgrsdfuyhlwkjdnkshbvhsgdvsdsjdbskdhskdjoihe73njndedejdoekekdednd

文件中可能有一些错误生成的数据,这些数据可能用于任何早期请求或最新请求。你知道吗

如果生成的错误数据不是针对最新请求,则可以忽略。

示例:

o 123456789.000 10.10.10.10 3 30 10 -
n A-123456 1452830400 1 1452
n C-73652 1452830400 1 23154
n B-967845 1452830400 1  37451
n G-809573 1452830400 1 92673   
o 123456789.000 10.10.10.10 3 30 10 some_random_text_alphanumeric_jdjfdjfdfhdkjfhdkhfdhfdfdfhkdhfkdjfdkjfkdfdkfdkjnc maxbgrsdfuyhlwkjdnkshbvhsgdvsdsjdbskdhskdjoihe73njndedejdoekekdednd
o 123456789.000 10.10.10.10 3 30 10 -
n A-123456 1452830400 1 1452
n C-73652 1452830400 1 23154
n B-967845 1452830400 1  37451
n G-809573 1452830400 1 92673   
# Should be indication of request i.e., line beginning with o, followed some data
o 123456789.000 10.10.10.10 3 30 10 -
n A-123456 1452830400 1 1452
n C-73652 1452830400 1 23154
n B-967845 1452830400 1  37451
n G-809573 1452830400 1 92673   
o 123456789.000 10.10.10.10 3 30 10 some_random_text_alphanumeric_jdjfdjfdfhdkjfhdkhfdhfdfdfhkdhfkdjfdkjfkdfdkfdkjnc maxbgrsdfuyhlwkjdnkshbvhsgdvsdsjdbskdhskdjoihe73njndedejdoekekdednd

如果为最新请求生成了错误的数据,则应捕获并突出显示该数据。

示例:

o 123456789.000 10.10.10.10 3 30 10 -
n A-123456 1452830400 1 1452
n C-73652 1452830400 1 23154
n B-967845 1452830400 1  37451
n G-809573 1452830400 1 92673   
o 123456789.000 10.10.10.10 3 30 10 some_random_text_alphanumeric_jdjfdjfdfhdkjfhdkhfdhfdfdfhkdhfkdjfdkjfkdfdkfdkjnc maxbgrsdfuyhlwkjdnkshbvhsgdvsdsjdbskdhskdjoihe73njndedejdoekekdednd
o 123456789.000 10.10.10.10 3 30 10 -
n A-123456 1452830400 1 1452
n C-73652 1452830400 1 23154
n B-967845 1452830400 1  37451
n G-809573 1452830400 1 92673   
o 123456789.000 10.10.10.10 3 30 10 some_random_text_alphanumeric_jdjfdjfdfhdkjfhdkhfdhfdfdfhkdhfkdjfdkjfkdfdkfdkjnc maxbgrsdfuyhlwkjdnkshbvhsgdvsdsjdbskdhskdjoihe73njndedejdoekekdednd
o 123456789.000 10.10.10.10 3 30 10 -
n A-123456 1452830400 1 1452
n C-73652 1452830400 1 23154
n B-967845 1452830400 1  37451
n G-809573 1452830400 1 92673   
# No line present i,e., (o 123456789.000 10.10.10.10 3 30 10 some_random_text_alphanumeric_jdjfdjfdfhdkjfhdkhfdhfdfdfhkdhfkdjfdkjfkdfdkfdkjnc maxbgrsdfuyhlwkjdnkshbvhsgdvsdsjdbskdhskdjoihe73njndedejdoekekdednd)

缺行可以是以o开头的第一行,也可以是以o开头的最后一行

我需要检查,每个请求日志都是用这种格式写的,一个文件中捕获了多少成功和不成功的请求?你知道吗

方法1:可以读取文件内容,然后像用o等行一样进行解析,我觉得这不可行

方法2:我觉得,reg ex是最佳和最好的解决方案。你知道吗

哪一个最好?你能帮我实现吗?你知道吗

迄今为止已尝试:

reg_ex1 = "o\s+\d+(\.\d+)?\d+\s+\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}\s+\d+\s+\d+\s+\d+\s+-"
reg_ex2 = "o\s+\d+(\.\d+)?\d+\s+\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}\s+\d+\s+\d+\s+\d+\s+[a-zA-Z0-9_]+"
with open(""some_file.log, 'r') as content_file:
content = content_file.read()
pattern1 = re.compile(reg_ex1)
begin_lines = len(pattern1.findall(content))
pattern2 = re.compile(reg_ex2)
end_lines = len(pattern2.findall(content))
if begin_lines == end_lines:
   print "File has successful requests captured"
else:
   print "File has un-successful requests captured"
   # If wrong-data generated is not for latest request, can be ignored.
   # If wrong-data generated is for latest request, it should be caught and highlighted.

May be not a good idea though, please let me know.

升级版:

o 123456789.000 10.10.10.10 3 30 10 001-
n A-123456===123 1452830400 1 14521
n C-73652 1452830400 1 231541
n B-967845 1452830400 1  37451
n G-809573==123 1452830400 1 926731
o 123456789.000 10.10.10.10 3 30 10 some_random_text_alphanumeric_jdjfdjfdfhdkjfhdkhfdhfdfdfhkdhfkdjfdkjfkdfdkfdkjnc maxbgrsdfuyhlwkjdnkshbvhsgdvsdsjdbskdhskdjoihe73njndedejdoekekdednd
o 123456789.000 10.10.10.10 3 30 10 002-
n A-123456===456 1452830400 1 14522
n C-73652 1452830400 1 231542
n B-967845 1452830400 1  37452
n G-809573===456 1452830400 1 926732
o 123456789.000 10.10.10.10 3 30 10 003-
n A-123456===789 1452830400 1 14523
n C-73652 1452830400 1 231543
n B-967845 1452830400 1  374513
n G-809573===789 1452830400 1 926733
o 123456789.000 10.10.10.10 3 30 10 some_random_text_alphanumeric_jdjfdjfdfhdkjfhdkhfdhfdfdfhkdhfkdjfdkjfkdfdkfdkjnc maxbgrsdfuyhlwkjdnkshbvhsgdvsdsjdbskdhskdjoihe73njndedejdoekekdednd

对于上面的文本,我想提取数据包1和3。你知道吗


Tags: 文件数据textdatarequest错误randomsome
3条回答
^n\s.+[\n\r]+o\s.+[\n\r]+n\s.+|^n\s.+[\n\r]+n\s.+[\n\r]+n\s.+[\n\r]+n\s.+[\n\r]+(?!o)|^o\s.+[\n\r]+o\s.+[\n\r]+o\s.+

enter image description here

我肯定会推荐方法1,为什么。。? 这样,我们就可以灵活地读取/迭代每一行。。你知道吗

with open('file.txt', r) as fp:
  line = fp.readline()
  print(type(line))  #string

  #do anything with line(string)
  #1. split_list= fp.split()   list of values separated by space
  #2. Check type of each element: 
   # split_list[0].isalpha(), 
   # split_list[0].isalpha(),
   # split_list[0].isdigit(),
   # split_list[0].isspace() like so, and then do required adding to final dict/list..

一定要尝试:抓住,每一步。。因为日志文件是不可预测的。你知道吗

为了检查文件是good还是bad,我们将使用文件的第一行和最后一行,考虑到:

  1. 如果第一行不是以o开头,则文件是错误的
  2. 如果最后一行没有以o结尾,则文件是坏的
  3. 如果第一行和最后一行以o开头,则文件是好的

你知道吗列表.txt地址:

o 123456789.000 10.10.10.10 3 30 10 -
n A-123456 1452830400 1 1452
n C-73652 1452830400 1 23154
n B-967845 1452830400 1  37451
n G-809573 1452830400 1 92673
o 123456789.000 10.10.10.10 3 30 10 some_random_text_alphanumeric_jdjfdjfdfhdkjfhdkhfdhfdfdfhkdhfkdjfdkjfkdfdkfdkjnc maxbgrsdfuyhlwkjdnkshbvhsgdvsdsjdbskdhskdjoihe73njndedejdoekekdednd
o 123456789.000 10.10.10.10 3 30 10 -
n A-123456 1452830400 1 1452
n C-73652 1452830400 1 23154
n B-967845 1452830400 1  37451
n G-809573 1452830400 1 92673
# Should be indication of request i.e., line beginning with o, followed some data
o 123456789.000 10.10.10.10 3 30 10 -
n A-123456 1452830400 1 1452
n C-73652 1452830400 1 23154
n B-967845 1452830400 1  37451
n G-809573 1452830400 1 92673
o 123456789.000 10.10.10.10 3 30 10 some_random_text_alphanumeric_jdjfdjfdfhdkjfhdkhfd

因此:

logFile = "list.txt"    
with open(logFile) as f:
    content = f.readlines()

# you may also want to remove empty lines
content = [l.strip() for l in content if l.strip()]

for line in content:
    if line.startswith("o"):  # check if the first line starts with o
        if str(content[-1]).strip("[']").split()[0] == 'o': # check if last line starts with o
            print("File is good.")
        else:
            print("File is bad.")
        break
    else:                    # end if the first line does not start with o
        print("File is bad.")
        break

编辑:

要获得有效o之间的所有响应:

你知道吗列表.txt地址:

o 123456789.000 10.10.10.10 3 30 10 001-
n A-123456 1452830400 1 14521
n C-73652 1452830400 1 231541
n B-967845 1452830400 1  37451
n G-809573 1452830400 1 926731
o 123456789.000 10.10.10.10 3 30 10 some_random_text_alphanumeric_jdjfdjfdfhdkjfhdkhfdhfdfdfhkdhfkdjfdkjfkdfdkfdkjnc maxbgrsdfuyhlwkjdnkshbvhsgdvsdsjdbskdhskdjoihe73njndedejdoekekdednd
o 123456789.000 10.10.10.10 3 30 10 002-
n A-123456 1452830400 1 14522
n C-73652 1452830400 1 231542
n B-967845 1452830400 1  37452
n G-809573 1452830400 1 926732
o 123456789.000 10.10.10.10 3 30 10 some_random_text_alphanumeric_jdjfdjfdfhdkjfhdkhfdhfdfdfhkdhfkdjfdkjfkdfdkfdkjnc maxbgrsdfuyhlwkjdnkshbvhsgdvsdsjdbskdhskdjoihe73njndedejdoekekdednd
o 123456789.000 10.10.10.10 3 30 10 003-
n A-123456 1452830400 1 14523
n C-73652 1452830400 1 231543
n B-967845 1452830400 1  374513
n G-809573 1452830400 1 926733

因此:

import re
def GetTheResponses(infile):
     with open(infile) as fp:
         red = fp.read()
         for result in re.findall('o (.*?)o ', red, re.S):
             print(result)

GetTheResponses('list.txt')

输出

123456789.000 10.10.10.10 3 30 10 001-
n A-123456 1452830400 1 14521
n C-73652 1452830400 1 231541
n B-967845 1452830400 1  37451
n G-809573 1452830400 1 926731

123456789.000 10.10.10.10 3 30 10 002-
n A-123456 1452830400 1 14522
n C-73652 1452830400 1 231542
n B-967845 1452830400 1  37452
n G-809573 1452830400 1 926732

编辑2(为了更好的可读性):

count = 1
for result in re.findall('o (.*?)o ', red, re.S):
    print("Response Packet: {}".format(count))
    print("\n".join(result.split("\n")[1:]))
    count +=1

输出

Response Packet: 1
n A-123456 1452830400 1 14521
n C-73652 1452830400 1 231541
n B-967845 1452830400 1  37451
n G-809573 1452830400 1 926731

Response Packet: 2
n A-123456 1452830400 1 14522
n C-73652 1452830400 1 231542
n B-967845 1452830400 1  37452
n G-809573 1452830400 1 926732

相关问题 更多 >