从Pattern1检索文本到Pattern2 Python

2024-09-27 19:29:39 发布

您现在位置:Python中文网/ 问答频道 /正文

我有一个输入文件如下

PATTERN1 PTR1 blah blah blah
needThis  blah blah blah
thisOneAsWell  blah blah blah
PATTERN2

PATTERN1 PTR2 blah blah blah
needThis  blah blah blah
thisOneAsWell  blah blah blah
PATTERN2 

............................
............................

PATTERN1  PTRN blah blah
needThis  blah blah blah
thisOneAsWell blah blah blah
PATTERN2

我需要函数只返回PATTERN1到PATTERN2的第一列条目,如下所示

PTR1
needThis thisOneAsWell

PTR2
needThis thisOneAsWell

......................
......................
PTRN
needThis thisOneAsWell

PTR1,PTR2。。。。。。PTRN是不同的文本。PATTERN1和PATTERN2不同,但始终存在于文件中。你知道吗

如何在Python中实现这一点?你知道吗

我仍然是Python的初学者,我正在尝试实现这个用途关于芬德尔()未获得所需的o/p:

def retrieve():
    file = open("fileName","r")
    string = re.findall(r"PATTERN1",file.read())
    print string

Tags: 文件函数文本string条目fileblah初学者
2条回答

可以嵌套两个正则表达式:

txt='''\
PATTERN1 PTR1 blah blah blah
needThis1  blah blah blah
thisOneAsWell1  blah blah blah
PATTERN2

PATTERN1 PTR2 blah blah blah
needThis2  blah blah blah
thisOneAsWell2  blah blah blah
PATTERN2 

............................
............................

PATTERN1  PTRN blah blah
needThisN  blah blah blah
thisOneAsWellN blah blah blah
PATTERN2'''

import re

for m in re.finditer(r'^PATTERN1\s*(.*?)(?=^PATTERN2)', txt, re.M | re.S):
    print re.findall(r'(^\w+)', m.group(1), re.M)

印刷品:

['PTR1', 'needThis1', 'thisOneAsWell1']
['PTR2', 'needThis2', 'thisOneAsWell2']
['PTRN', 'needThisN', 'thisOneAsWellN']

编辑1

如果您使用的文件很容易放入内存:

with open(fn) as f:
    txt=f.read()
    for m in re.finditer(r'^PATTERN1\s*(.*?)(?=^PATTERN2)', txt, re.M | re.S):
        print re.findall(r'(^\w+)', m.group(1), re.M)

使用mmap处理不容易放入内存的较大文件。你知道吗


编辑2

将结果合并成字符串后,只需将结果附加到列表中:

with open(fn) as f:
    results=[]
    txt=f.read()
    for m in re.finditer(r'^PATTERN1\s*(.*?)(?=^PATTERN2)', txt, re.M | re.S):
        results.append('\n'.join(re.findall(r'(^\w+)', m.group(1), re.M))
    print '\n===\n'.join(results)
import re
with open('file', 'r') as f:
    content = f.read()
    matches = re.findall(r'PATTERN1(.*?)PATTERN2', content, re.MULTILINE|re.DOTALL)

for match in matches:
    for line in match.split('\n'):
        columns = line.split()
        if columns:
            print(columns[0])

相关问题 更多 >

    热门问题