在正则表达式Python中不使用OR运算符的负查找

2024-10-03 23:18:38 发布

您现在位置:Python中文网/ 问答频道 /正文

我有两个场景可以从具有如下结构的日志文件中获取一些信息:

proc format;

2018-04-12T07:45:52,430 INFO  [00000009] :t707982 - 26         
2018-04-12T07:45:52,430 INFO  [00000009] :t707982 - 27         
2018-04-12T07:45:52,433 INFO  [00000009] :t707982 - 35         '0010','0019'="08"
2018-04-12T07:45:52,434 INFO  [00000009] :t707982 - 36         '0005','0007','0011','0013'="09"

NOTE: There were 95219365 observations read from the data set WORK.TESTE1.
2018-04-12T07:55:41,536 INFO  [00000018] :t707982 - NOTE: The data set WORK.TESTE1 has 95219365 observations and 9 variables.
2018-04-12T07:55:41,537 INFO  [00000018] :t707982 - NOTE: Compressing data set WORK.TESTE1 decreased size by 34.04 percent. 
2018-04-12T07:55:41,538 INFO  [00000018] :t707982 -       Compressed is 92230 pages; un-compressed would require 139823 pages.
2018-04-12T07:55:42,230 INFO  [00000018] :t707982 - NOTE: PROCEDURE FORMAT used (Total process time):
2018-04-12T07:55:42,231 INFO  [00000018] :t707982 -       real time           2:07.03
2018-04-12T07:55:42,231 INFO  [00000018] :t707982 -       user cpu time       1:56.98
2018-04-12T07:55:42,231 INFO  [00000018] :t707982 -       system cpu time     39.22 seconds
2018-04-12T07:55:42,231 INFO  [00000018] :t707982 -       memory              3159502.32k

proc format;

2018-04-12T08:45:52,430 INFO  [00000009] :t707982 - 26         
2018-04-12T08:45:52,434 INFO  [00000009] :t707982 - 36         '0005','0007','0011','0013'="09"
NOTE: There were 95219365 observations read from the data set WORK.TESTE2.
2018-04-12T08:55:41,536 INFO  [00000018] :t707982 - NOTE: The data set WORK.TESTE2 has 95219365 observations and 9 variables.
2018-04-12T08:55:41,537 INFO  [00000018] :t707982 - NOTE: Compressing data set WORK.TESTE2 decreased size by 34.04 percent. 
2018-04-12T08:55:41,538 INFO  [00000018] :t707982 -       Compressed is 92230 pages; un-compressed would require 139823 pages.
2018-04-12T08:55:42,230 INFO  [00000018] :t707982 - NOTE: PROCEDURE FORMAT used (Total process time):
2018-04-12T08:55:42,231 INFO  [00000018] :t707982 -       real time           2:07.03
2018-04-12T08:55:42,231 INFO  [00000018] :t707982 -       user cpu time       1:56.98
2018-04-12T08:55:42,231 INFO  [00000018] :t707982 -       system cpu time     39.22 seconds
2018-04-12T08:55:42,231 INFO  [00000018] :t707982 -       memory              3159502.32k

1)提取过程{format}和注意:过程{format}之间的所有信息

2)如果第一个过程{format}没有注释:过程{format},它需要在发现另一个过程时停止捕获,并且不要从第二个过程{format}返回注释:过程{format},如本例所示:

proc format;

2018-04-12T07:45:52,430 INFO  [00000009] :t707982 - 26         
2018-04-12T07:45:52,430 INFO  [00000009] :t707982 - 27         
2018-04-12T07:45:52,433 INFO  [00000009] :t707982 - 35         '0010','0019'="08"
2018-04-12T07:45:52,434 INFO  [00000009] :t707982 - 36         '0005','0007','0011','0013'="09"

NOTE: There were 95219365 observations read from the data set WORK.TESTE1.
2018-04-12T07:55:41,536 INFO  [00000018] :t707982 - NOTE: The data set WORK.TESTE1 has 95219365 observations and 9 variables.
2018-04-12T07:55:41,537 INFO  [00000018] :t707982 - NOTE: Compressing data set WORK.TESTE1 decreased size by 34.04 percent. 
2018-04-12T07:55:41,538 INFO  [00000018] :t707982 -       Compressed is 92230 pages; un-compressed would require 139823 pages.


proc format;

2018-04-12T08:45:52,430 INFO  [00000009] :t707982 - 26         
2018-04-12T08:45:52,434 INFO  [00000009] :t707982 - 36         '0005','0007','0011','0013'="09"
NOTE: There were 95219365 observations read from the data set WORK.TESTE2.
2018-04-12T08:55:41,536 INFO  [00000018] :t707982 - NOTE: The data set WORK.TESTE2 has 95219365 observations and 9 variables.
2018-04-12T08:55:41,537 INFO  [00000018] :t707982 - NOTE: Compressing data set WORK.TESTE2 decreased size by 34.04 percent. 
2018-04-12T08:55:41,538 INFO  [00000018] :t707982 -       Compressed is 92230 pages; un-compressed would require 139823 pages.
2018-04-12T08:55:42,230 INFO  [00000018] :t707982 - NOTE: PROCEDURE FORMAT used (Total process time):
2018-04-12T08:55:42,231 INFO  [00000018] :t707982 -       real time           2:07.03
2018-04-12T08:55:42,231 INFO  [00000018] :t707982 -       user cpu time       1:56.98
2018-04-12T08:55:42,231 INFO  [00000018] :t707982 -       system cpu time     39.22 seconds
2018-04-12T08:55:42,231 INFO  [00000018] :t707982 -       memory              3159502.32k

所以,我的问题是第二种情况。我的正则表达式一直从第二个过程格式捕获注意:过程格式,而它应该忽略第一个,只捕获第二个:

(?s)(?<=proc[ ])(?P<type>\w+).*?(?:(?<=NOTE:[ ]PROCEDURE[ ])|(?<!=proc[ ]))(?P=type).*?(?=memory)

我用OR运算符尝试了|(?<!=proc[ ])后面的否定查找,但仍然没有成功

you can see my regex in action here

你能帮我吗


Tags: infoformatdatatime过程pagesproccpu
1条回答
网友
1楼 · 发布于 2024-10-03 23:18:38

对于该数据结构,要获取proc {format} and note: procedure {format}之间的数据,不必使用内联修饰符(?s)让点匹配换行符以防止不必要的回溯

如果希望数据介于两者之间,可以添加一个捕获组,而不是在开始时使用正向查找,匹配proc format;

要获取中间的数据,您可以匹配不以任何proc格式开头的所有行;或包含 NOTE: PROCEDURE

介于两者之间的数据是捕获组2

^proc (?P<type>\w+);\r?\n\s*((?:(?!proc |.* NOTE: PROCEDURE ).*\r?\n)*.*(?= NOTE: PROCEDURE ))

解释

  • ^行的开始
  • proc 逐字匹配
  • (?P<type>\w+);命名组{},匹配1+单词字符
  • \r?\n\s*匹配换行符和0+空格字符
  • (捕获第2组
    • (?:非捕获组
      • (?!proc |.* NOTE: PROCEDURE )断言直接在右边的不是proc 或该行包含 NOTE: PROCEDURE
      • .*\r?\n匹配除换行符0+次后跟换行符以外的任何字符
    • )*关闭组并重复0+次以匹配所有行
    • .*(?= NOTE: PROCEDURE )匹配任何字符,除了声明右边内容的换行符是 NOTE: PROCEDURE
  • )关闭组2

Regex demo for the first dataRegex demo for the second data

相关问题 更多 >