将奇怪的文件解析为文本文件

2024-09-30 01:32:22 发布

您现在位置:Python中文网/ 问答频道 /正文

我有这样一个小例子:

小例子:

>ENSG00000003249|ENST00000002501|DBNDD1|2079
GCCGCGGCCCCCCGGTTGCTGCCCCGATGCGCTGCGCCCGGAGCCGGGGCCGAGTCGCTG
CCGCAGCTGTTGGGGCGCCCGGGCCAGGCGACGCCGCCGTCGCCCGTGCCCCTCCCAGAC
CGCACCGGCCGC
>ENSG00000048028|ENST00000003302|USP28|4669
AGTCCTGAGAGGCTGGGCCGGCGGCGGCTGCGGCGGGAGACCGGTGACCCGCGGCTGGGC
GCCTCGGCC

">"开头的第一行有4个部分,由"|"分隔,下一行是与以">"开头的行相关的字符序列。 我想把这个文件解析成一个包含5列的文本文件。前4列位于以">"开头的第一行,第五行是序列。 例如,对于最后一个序列,结果如下:

ENSG00000048028 ENST00000003302 USP28 4669 AGTCCTGAGAGGCTGGGCCGGCGGCGGCTGCGGCGGGAGACCGGTGACCCGCGGCTGGGCGCCTCGGCC

我编写了此代码,但不起作用:

list = []
with open(inputfile) as f:
    for line in f:
        if line.startswith('>'):
            parts = line.split('|')
        else:
            parts = sequence
        list.append(parts)

infile = open('test.txt', 'w')
for item in list:
  infile.write("%s\n" % item)

Tags: inforline序列openiteminfilelist
1条回答
网友
1楼 · 发布于 2024-09-30 01:32:22

这是FASTA file format。如果您想手动解析它,那么存储标题行以供以后使用。请注意,序列可以跨多行中断;仅当到达末尾或到达新标题时,才写出组合列

我将使用csv模块来编写输出:

import csv

with open(inputfile) as f, open('test.txt', 'w') as outfile:
    header = sequence = None
    out = csv.writer(outfile, delimiter='|')
    for line in f:
        if line.startswith('>'):  # header
            # write out previous data
            if header:
                entry = header + [''.join(sequence)]
                out.writerow(entry)
            header = line.strip('>\n').split('|')
            sequence = []
        else:
            sequence.append(line.strip())

    if header:
        entry = header + [''.join(sequence)]
        out.writerow(entry)

演示:

>>> from io import StringIO
>>> import csv
>>> demoinput = StringIO('''\
... >ENSG00000003249|ENST00000002501|DBNDD1|2079
... GCCGCGGCCCCCCGGTTGCTGCCCCGATGCGCTGCGCCCGGAGCCGGGGCCGAGTCGCTG
... CCGCAGCTGTTGGGGCGCCCGGGCCAGGCGACGCCGCCGTCGCCCGTGCCCCTCCCAGAC
... CGCACCGGCCGC
... >ENSG00000048028|ENST00000003302|USP28|4669
... AGTCCTGAGAGGCTGGGCCGGCGGCGGCTGCGGCGGGAGACCGGTGACCCGCGGCTGGGC
... GCCTCGGCC
... ''')
>>> outfile = StringIO()
>>> f = demoinput
>>> header = sequence = None
>>> out = csv.writer(outfile, delimiter='|')
>>> for line in f:
...     if line.startswith('>'):  # header
...         # write out previous data
...         if header:
...             entry = header + [''.join(sequence)]
...             out.writerow(entry)
...         header = line.strip('>\n').split('|')
...         sequence = []
...     else:
...         sequence.append(line.strip())
...
178
>>> if header:
...     entry = header + [''.join(sequence)]
...     out.writerow(entry)
...
114
>>> print(outfile.getvalue())
ENSG00000003249|ENST00000002501|DBNDD1|2079|GCCGCGGCCCCCCGGTTGCTGCCCCGATGCGCTGCGCCCGGAGCCGGGGCCGAGTCGCTGCCGCAGCTGTTGGGGCGCCCGGGCCAGGCGACGCCGCCGTCGCCCGTGCCCCTCCCAGACCGCACCGGCCGC
ENSG00000048028|ENST00000003302|USP28|4669|AGTCCTGAGAGGCTGGGCCGGCGGCGGCTGCGGCGGGAGACCGGTGACCCGCGGCTGGGCGCCTCGGCC

相关问题 更多 >

    热门问题