从文本文件中提取块

2024-09-28 23:39:34 发布

您现在位置:Python中文网/ 问答频道 /正文

我有一个文本文件,它有以下格式的块

...some lines before this...
MY TEST MATRIX (ROWS)
 0.5056E+03  0.8687E-03 -0.1202E-02 
 0.5056E+03  0.8687E-03 -0.1202E-02 
MY TEST END
 0.5056E+03  0.8687E-03 -0.1202E-02  
 0.3776E+03  0.8687E-03  0.1975E-04  
STOP
---some lines after this
MY TEST MATRIX (ROWS)
 2E+04  2E+04  0.8687E-03  
 2E+04  2E+04  0.8687E-03
MY TEST END
 0.5056E+03  0.8687E-03 -0.1202E-02 
 0.5056E+03  0.8687E-03 -0.1202E-02 
STOP
---some lines after this
---this repeats in txt file----

有许多这样的块和块出现在不同的地方的文本文件。我只想提取出现在我的测试矩阵(行)和我的测试结束之间的值,我的测试结束和停止到单个数组让我们调用它们firstvalue[]和secondvalue[]。你知道吗

对我来说,一块是“我的测试矩阵-我的测试结束-停止”

通过这里显示的简单代码,我可以从文本文件中读取一块数据。但是,由于文本文件中有重复的块,我不知道如何从上述两个数组中的每个块捕获数据。你知道吗

    import os
    import sys
    from math import *
    firstValue = []
    secondValue = []
    checkFirst = False
    checkSecond = False
    filename="r3dmdtr2.txt"
    with open(filename, "r") as infile:

        for line in infile:
            if line.strip().startswith("MY TEST MATRIX (ROWS)"):
                checkFirst = True
            if line.strip().startswith("MY TEST END"):
                checkFirst = False
                checkSecond = True
            if line.strip().startswith("STOP"):
                checkSecond = False  

            if checkFirst:
                firstValue.append(line) 

            if checkSecond:
                secondValue.append(line)          

    print(firstValue)
    print (secondValue)

上面的片段完美地读取了一个数据块。如何解析文本文件中的所有重复块,并将它们作为单个数组附加到firstValue[]

比如:

firstvalue=[[来自第一块的值],[来自第二块的值],依此类推。。。 secondvalue=[[来自第一块的值],[来自第二块的值],依此类推。。。你知道吗


Tags: testfalseifmylinesomethismatrix
2条回答

给出:

$ cat file.txt
...some lines before this...
MY TEST MATRIX (ROWS)
 0.5056E+03  0.8687E-03 -0.1202E-02 
 0.5056E+03  0.8687E-03 -0.1202E-02 
MY TEST END
 0.5056E+03  0.8687E-03 -0.1202E-02  
 0.3776E+03  0.8687E-03  0.1975E-04  
STOP
 -some lines after this
MY TEST MATRIX (ROWS)
 2E+04  2E+04  0.8687E-03  
 2E+04  2E+04  0.8687E-03
MY TEST END
 0.5056E+03  0.8687E-03 -0.1202E-02 
 0.5056E+03  0.8687E-03 -0.1202E-02 
STOP
 -some lines after this
 -this repeats in txt file  

sedperlawk中,可以使用range regex的概念来执行以下操作:

$ sed -nE '/^MY TEST MATRIX/,/^MY TEST END/p' file.txt
MY TEST MATRIX (ROWS)
 0.5056E+03  0.8687E-03 -0.1202E-02 
 0.5056E+03  0.8687E-03 -0.1202E-02 
MY TEST END
MY TEST MATRIX (ROWS)
 2E+04  2E+04  0.8687E-03  
 2E+04  2E+04  0.8687E-03
MY TEST END

您可以用一个FlipFlop类在Python中复制此功能:

class FlipFlop: 
    ''' Class to imitate the bahavior of /start/, /end/ flip flop in awk '''
    def __init__(self, start_pattern, end_pattern):
        self.patterns = start_pattern, end_pattern
        self.state = False
    def __call__(self, st):
        ms=[e.search(st) for e in self.patterns]
        if all(m for m in ms):
            self.state = False
            return True
        rtr=True if self.state else False
        if ms[self.state]:
            self.state = not self.state
        return self.state or rtr

然后在逐行读取文件时捕获块:

di={}
blocks=[FlipFlop(re.compile(r'^MY TEST MATRIX \(ROWS\)'), re.compile(r'^MY TEST END')),
        FlipFlop(re.compile(r'^MY TEST END'), re.compile(r'^STOP'))]
for i, ff in enumerate(blocks):         
    with open(fn) as f:
        di[i]=[line.strip() for line in f if ff(line)]

结果:

>>> di
{0: ['MY TEST MATRIX (ROWS)', 
     '0.5056E+03  0.8687E-03 -0.1202E-02', 
     '0.5056E+03  0.8687E-03 -0.1202E-02', 
     'MY TEST END', 
     'MY TEST MATRIX (ROWS)', 
     '2E+04  2E+04  0.8687E-03', 
     '2E+04  2E+04  0.8687E-03', 
     'MY TEST END'], 
 1: ['MY TEST END', 
     '0.5056E+03  0.8687E-03 -0.1202E-02', 
     '0.3776E+03  0.8687E-03  0.1975E-04', 
     'STOP', 
     'MY TEST END', 
     '0.5056E+03  0.8687E-03 -0.1202E-02', 
     '0.5056E+03  0.8687E-03 -0.1202E-02', 
     'STOP']}

这确实可以读取文件两次以节省内存;如果速度更重要,则可以将文件读入内存并对其进行迭代。你知道吗

您可以使用re.findall

>>> import re
>>> data = open('file.txt').read()
>>> blocks = re.findall(r'MY TEST MATRIX \(ROWS\)\s*(.*?)\s+MY TEST END\s*(.*?)\s+STOP', data, re.DOTALL)
>>> first, second = zip(*blocks)
>>> print (first)
('2X+00  2X+00  1X+00  \n 2X+00  2X+00  1K+00', '2P+00  2X+00  1M+00  \n 2X+00  2Z+00  1K+00')
>>> print (second)
('2Y+00  2Y+00  1E+00  \n 2Y+00  2Z+00  1E+00', '2Y+00  2Y+00  1E+00  \n 2Y+00  2Z+00  1E+00')

相关问题 更多 >