从FASTA文件提取序列到多个文件，文件基于单独fi中的头_id

#!/usr/bin/env python import sys from Bio import SeqIO input_file = sys.argv[1] id_file = sys.argv[2] output_file = sys.argv[3] wanted = set(line.rstrip("\n").split(None,1)[0] for line in open(id_file)) print "Found %i unique identifiers in %s" % (len(wanted), id_file) index = SeqIO.index(input_file, "fasta") records = (index[r] for r in wanted) count = SeqIO.write(records, output_file, "fasta") assert count == len(wanted) print "Saved %i records from %s to %s" % (count, input_file, output_file)

2条回答

网友

1楼 · 编辑于 2024-10-01 09:30:30

一些简短的建议：

如果所有标题都遵循相同的模式，则可以提取唯一的元素：

record.description.split("_")[1]

（从“CAP357_2040_011wpi_v1v3_1_008_00006_001.1”得到“2040”）

如果使用dict，则可以收集记录集合：

^{pr2}$

然后，您可以将每个集合写入一个新文件：

file_name = "outfile%s" 
for (descr, records) in collected.items():   # iteritems in python2
    with open(os.path.join(file_path, file_name % descr), 'w') as f:
        SeqIO.write(records, f, 'fasta')

网友

2楼 · 编辑于 2024-10-01 09:30:30

为了完整起见，以下是“最终”脚本：

#!/usr/bin/env python
# a script to extract fasta records from a fasta file to multiple separate fasta files based on a particular ID (time point) in a particular field, for a given delimiter
# to run, navigate to file location with command prompt and enter: python split_fasta_by_collections.py infile.fasta
from Bio import SeqIO
import os
import sys

records = SeqIO.parse(sys.argv[1], "fasta")
collected = {}
for record in records:
    descr = record.description.split("_")[1] # "_" sets the delimeter, "1" sets the field where counting starts at 0 for the first field
    try:
    collected[descr].append(record)
    except KeyError:
    collected[descr] = [record ,]

file_name = "outfile%s.fasta" 
file_path = os.getcwd() #sets the output file path to your current working directory

for (descr, records) in collected.items():  
    with open(os.path.join(file_path, file_name % descr), 'w') as f:
    SeqIO.write(records, f, 'fasta')

我是python新手，正在尝试找到一种方法：

头文件如下：

我的fasta文件示例如下：

预期输出

相关问题更多 >

编程相关推荐

热门问题

热门文章