如何在python中提取与模式匹配的两个连续行

2024-09-29 23:23:50 发布

您现在位置:Python中文网/ 问答频道 /正文

我正试图从test.txt中提取匹配两种不同模式的行。
首先,我想提取匹配>> fbat -v1的行,然后匹配p-value(2-sided)正下方的对应行

这是我尝试过的代码,但它只提取第一个匹配项

import re

file = open('test.txt')
for line in file:
    match = re.findall('^>> fbat -v1', line)
    if match:
        print line

我也试着在R中这样做,但似乎R不太适合这样做。我不熟悉python,所以请有人帮我解决这个问题。 先谢谢你

test.txt:

>> fbat -v1 1:939467:A:G
trait STATUS; offset 0.150; model additive; test bi-allelic; minsize 2; min_freq 0.000; p 1.000; maxcmh 1000

Marker            afreq     fam#       weight     S-E(S)      Var(S)      Z        P
----------------------------------------------------------------------------------------

Weighted FBAT rare variant statistics for the SNPs:

W           Var(W)      Z           p-value(2-sided)
----------------------------------------------------
0.400       0.240       0.816       4.14216178e-01
----------------------------------------------------


>> fbat -v1 1:941298:C:T 1:941301:G:A 1:941310:C:T 1:941324:G:A
trait STATUS; offset 0.150; model additive; test bi-allelic; minsize 2; min_freq 0.000; p 1.000; maxcmh 1000

Marker            afreq     fam#       weight     S-E(S)      Var(S)      Z        P
----------------------------------------------------------------------------------------

Weighted FBAT rare variant statistics for the SNPs:

W           Var(W)      Z           p-value(2-sided)
----------------------------------------------------
0.333       0.444       0.500       6.17075077e-01
----------------------------------------------------

结果:

>> fbat -v1 1:939467:A:G 0.400       0.240       0.816       4.14216178e-01
>> fbat -v1 1:941298:C:T 1:941301:G:A 1:941310:C:T 1:941324:G:A 0.333       0.444       0.500       6.17075077e-01

Tags: testretxtforvaluevarmatchstatus
3条回答

如果您不想使用正则表达式,您可以使用generator,它允许您在读取大量数据(和10GB大文件)时减少RAM的使用

f = open("input.txt")

# you can replace f.readline() by string.splitlines()  by string_to_parse.splitlines() or f.readlines()
content = (line.replace("\n", "") for line in f.readlines())
result = []
try:
    # you can replace content by string.splitlines() if you read from a file
    for line in content: 
        #We try to find a line that starts with >> fbat -v1 
        if line.startswith(">> fbat -v1"):
            result_line = line
            # Jump lines until we find the one that ends with p-value(2-sided)
            while not next(content).endswith("p-value(2-sided)"):
                pass
            # jump one line to ignore the                           
            next(content) 
            # We add the line to our result
            result_line += next(content)
            # finally we add our result to a list 
            result.append(result_line) 
# this will happen if there is a >> fbat -v1 without p-value(2-sided) after
except StopIteration: 
    print('Could not find "p-value(2-sided)" after ">> fbat -v1" ')

# print the result
print("\n".join(result))

我在这里使用了一个文件来包含数据(如果是日志文件)

您可以使用正则表达式从多行中提取所需的数据来实现这一点。由于只有两个样本,很难知道这一个是否匹配所有情况:您的一些数据可能不像样本所显示的那样规则

这不遵循for line in file:的一行一行模式,因为您的数据由一束行组成

file = open('test.txt')
data = file.read()
rex = re.compile(r"(>> fbat -v1.+?\n).+?p-value\(2-sided\)\n-+\n(.+?)\n-", re.DOTALL)
for header, numbers in rex.findall(data):
    print (header.rstrip(), numbers)

输出为

>> fbat -v1 1:939467:A:G 0.400       0.240       0.816       4.14216178e-01
>> fbat -v1 1:941298:C:T 1:941301:G:A 1:941310:C:T 1:941324:G:A 0.333       0.444       0.500       6.17075077e-01

我顺便注意到您正在使用Python 2。除非这是一次性的,请考虑切换到Python 3。您不应该把时间花在学习Python 2上

import re

file = open('test.txt')
for idx, line in enumerate(file):
    match = re.findall('^>> fbat -v1', line)
    if match:
        match = re.findall('p-value(2-sided)', file[idx+1])

当然,您需要处理最后一行,因为如果它与^>> fbat -v1匹配,您将尝试访问不存在的下一行

相关问题 更多 >

    热门问题