如何在fasta文件中选择头的最后一个字符？

fasta = open('x.fasta') output = open('x1.fasta', 'w') seq = '' for line in fasta: if line[0] == '>' and seq == '': header = line elif line[0] != '>': seq = seq + line for n in header: n = header[-1] if '1' in n: output.write(header + seq) header= line seq = '' if "1" in header: output.write(header + seq) output.close()

3条回答

网友

1楼 · 编辑于 2024-09-30 22:19:42

您可以从获取单个记录的列表开始，这些记录由'>'分隔，并使用换行符.split('\n', 1)单独拆分来提取标题和正文

records = [
    line.split('\n', 1)
    for line in fasta.read().split('>')[1:]
]

然后，您可以简单地筛选出不以1结尾的记录

for header, body in records:
    if header.endswith('1'):
        output.write('>' + header + '\n')
        output.write(body)

网友

2楼 · 编辑于 2024-09-30 22:19:42

一个选项是将整个文件读入字符串，然后将re.findall与以下正则表达式模式一起使用：

>[A-Z0-9]+-\w+1\r?\n[ACGT]+

示例脚本：

fasta = open('x.fasta')
text = fasta.read()
matches = re.findall(r'>[A-Z0-9]+-\w+1\r?\n[ACGT]+', text)
print(matches)

对于您上面提供的示例数据，这将打印：

['>XP1987651-apple1\nACCTTCCAAGTAG', '>XP1254115-pear1\nATGCCGTAGTCAA']

网友

3楼 · 编辑于 2024-09-30 22:19:42

当您看到匹配的标题行时，可以非常简单地设置一个标志

with open('x.fasta') as fasta, open('x1.fasta', 'w') as output:
    for line in fasta:
        if line.startswith('>'):
            select = line.endswith('1\n')
        if select:
            output.write(line)

这避免了将整个文件读入内存；一次只检查一行

可能会注意到line将在行尾包含换行符。我选择只保留它；有时，如果您使用line = line.rstrip('\n')对其进行修剪并在必要时将其添加回输出中，事情会变得更容易

相关问题更多 >

编程相关推荐

热门问题

热门文章