文件中的python find部分与大多数正则表达式匹配

2024-09-19 23:38:06 发布

您现在位置:Python中文网/ 问答频道 /正文

我每行读取一个文件行,并检查代码段的结束位置:出现一个特定的字符序列。这个序列可能发生在代码段中,因此我必须检查冗余:连续行包含该序列的次数。对于10个连续出现的情况,我应该返回连续出现开始检测代码段结尾的第一行。你知道吗

regexp_dict_02 = {'Name': 'EMPTY_PAGES', 'Expr':  '(FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF)'}
def FindEmptyPg(Inpfile,Section):

 NbrLine = []
 Pos = []
 flag = 0
 index = 0
 Ln = 0

 with open(Inpfile) as fp:
  for i, line in enumerate(fp):
   if i >= Section.startline and i < 30061 :
    s=re.search(regexp_dict_02['Expr'],line)
    if s:  
     NbrLine.append(i)


  logging.info (NbrLine)
  logging.info (len(NbrLine)) 
  for index in range((len(NbrLine))-1):
   if NbrLine[index+1] - NbrLine[index] == 1 : 
    logging.info (str (NbrLine[index+1]) + '  ' + str(NbrLine[index]))
    Pos.append (index)
    flag += 1   
    if flag == 5 : 
     Ln = NbrLine[Pos[0]]
     break
  logging.info (Pos)
  return Ln

enter image description here

在上面的代码中,我只在两个连续的行上进行检查,结果得到了错误的行号。我避免使用复杂的东西,如状态机等,但我仍然坚持。你知道吗


Tags: posinfoindexiflogging代码段section序列
1条回答
网友
1楼 · 发布于 2024-09-19 23:38:06

这里有一个解决方案。下面的代码对每一行进行迭代。每次找到匹配项时,它都会将行索引添加到block。一旦找到没有任何匹配项的行,该块就被视为“关闭”并创建一个新的空块,但在此之前,它会将块的len和第一个索引保存在results。这些是你唯一感兴趣的信息。最后,对results排序并选取最后一项(元组排序列表默认情况下将按元组的第一项排序,在本例中为块的len),这是一个元组,其中包含找到的最长块以及该块第一行的索引。你知道吗

t = \
'''
000010000000000000000000000000000000000011111
00001FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF00011111
00001FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF00011111
000010000000000000000000000000000000000011111
00001FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF00011111
00001FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF00011111
00001FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF00011111
00001FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF00011111
00001FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF00011111
000010000000000000000000000000000000000011111
00001FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF00011111
00001FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF00011111
00001FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF00011111
00001FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF00011111
00001FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF00011111
00001FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF00011111
00001FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF00011111
00001FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF00011111
00001FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF00011111
00001FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF00011111
00001FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF00011111
00001FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF00011111
00001FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF00011111
00001FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF00011111
000010000000000000000000000000000000000011111
000010000000000000000000000000000000000011111
000010000000000000000000000000000000000011111
'''

pattern = 'FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF'
block = []
results = []
for i, line in enumerate(t.split('\n')):
    if pattern in line:
        block.append(i)
    else:
        try:
            results.append((len(block), block[0])) #save the len and the first index of each block
            block = []
        except IndexError:
            pass


cons, index = sorted(results)[-1] #number of consecutive match, line index
print(f'max consecutive matches found: {cons} , stating at line {index}')

输出:

max consecutive matches found: 14 , stating at line 11

针对评论:

I need the first sufficient successive occurrences: first 10 successive occurrences matched then I catch the line.

您可以改为使用以下代码。你知道吗

pattern = 'FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF'
block = []
for i, line in enumerate(t.split('\n')):
    if pattern in line:
        block.append(i)
    else:
        if len(block) >= 10:
            print(f'found a block of at least 10 lines starting from line {block[0]}')
            break
        block = []

相关问题 更多 >