如何在python中提取与模式匹配的两个连续行

>> fbat -v1 1:939467:A:G trait STATUS; offset 0.150; model additive; test bi-allelic; minsize 2; min_freq 0.000; p 1.000; maxcmh 1000 Marker afreq fam# weight S-E(S) Var(S) Z P ---------------------------------------------------------------------------------------- Weighted FBAT rare variant statistics for the SNPs: W Var(W) Z p-value(2-sided) ---------------------------------------------------- 0.400 0.240 0.816 4.14216178e-01 ---------------------------------------------------- >> fbat -v1 1:941298:C:T 1:941301:G:A 1:941310:C:T 1:941324:G:A trait STATUS; offset 0.150; model additive; test bi-allelic; minsize 2; min_freq 0.000; p 1.000; maxcmh 1000 Marker afreq fam# weight S-E(S) Var(S) Z P ---------------------------------------------------------------------------------------- Weighted FBAT rare variant statistics for the SNPs: W Var(W) Z p-value(2-sided) ---------------------------------------------------- 0.333 0.444 0.500 6.17075077e-01 ----------------------------------------------------

3条回答

网友

1楼 · 编辑于 2024-09-29 23:23:50

如果您不想使用正则表达式，您可以使用generator，它允许您在读取大量数据（和10GB大文件）时减少RAM的使用

f = open("input.txt")

# you can replace f.readline() by string.splitlines()  by string_to_parse.splitlines() or f.readlines()
content = (line.replace("\n", "") for line in f.readlines())
result = []
try:
    # you can replace content by string.splitlines() if you read from a file
    for line in content: 
        #We try to find a line that starts with >> fbat -v1 
        if line.startswith(">> fbat -v1"):
            result_line = line
            # Jump lines until we find the one that ends with p-value(2-sided)
            while not next(content).endswith("p-value(2-sided)"):
                pass
            # jump one line to ignore the                           
            next(content) 
            # We add the line to our result
            result_line += next(content)
            # finally we add our result to a list 
            result.append(result_line) 
# this will happen if there is a >> fbat -v1 without p-value(2-sided) after
except StopIteration: 
    print('Could not find "p-value(2-sided)" after ">> fbat -v1" ')

# print the result
print("\n".join(result))

我在这里使用了一个文件来包含数据（如果是日志文件）

网友

2楼 · 编辑于 2024-09-29 23:23:50

您可以使用正则表达式从多行中提取所需的数据来实现这一点。由于只有两个样本，很难知道这一个是否匹配所有情况：您的一些数据可能不像样本所显示的那样规则

这不遵循for line in file:的一行一行模式，因为您的数据由一束行组成

file = open('test.txt')
data = file.read()
rex = re.compile(r"(>> fbat -v1.+?\n).+?p-value\(2-sided\)\n-+\n(.+?)\n-", re.DOTALL)
for header, numbers in rex.findall(data):
    print (header.rstrip(), numbers)

输出为

>> fbat -v1 1:939467:A:G 0.400       0.240       0.816       4.14216178e-01
>> fbat -v1 1:941298:C:T 1:941301:G:A 1:941310:C:T 1:941324:G:A 0.333       0.444       0.500       6.17075077e-01

我顺便注意到您正在使用Python 2。除非这是一次性的，请考虑切换到Python 3。您不应该把时间花在学习Python 2上

网友

3楼 · 编辑于 2024-09-29 23:23:50

import re

file = open('test.txt')
for idx, line in enumerate(file):
    match = re.findall('^>> fbat -v1', line)
    if match:
        match = re.findall('p-value(2-sided)', file[idx+1])

当然，您需要处理最后一行，因为如果它与^>> fbat -v1匹配，您将尝试访问不存在的下一行

相关问题更多 >

编程相关推荐

热门问题

热门文章