擅长:python、mysql、java
<p>这是从这里复制的:<a href="https://bioinformatics.stackexchange.com/questions/13708/remove-redundant-sequences-from-fasta-file-in-python">Remove Redundant Sequences from FASTA file in Python</a></p>
<p>使用Biopython,但与fasta文件一起使用,文件头为:</p>
<p>“>;标题类型请参见<a href="https://en.wikipedia.org/wiki/FASTA_format" rel="nofollow noreferrer">FAsta Format Wiki</a></p>
<pre><code>from Bio import SeqIO
import time
start = time.time()
seen = []
records = []
for record in SeqIO.parse("INPUT-FILE", "fasta"):
if str(record.seq) not in seen:
seen.append(str(record.seq))
records.append(record)
#writing to a fasta file
SeqIO.write(records, "OUTPUT-FILE", "fasta")
end = time.time()
print(f"Run time is {(end- start)/60}")
</code></pre>
<p>按照MattMDo的建议,使用列表的一组istead,速度更快:</p>
<pre><code>seen = set()
records = []
for record in SeqIO.parse("b4r2.fasta", "fasta"):
if record.seq not in seen:
seen.add(record.seq)
records.append(record)
</code></pre>
<p>我有一个较长的使用argparser的,但是速度较慢,因为如果需要的话,序列计数可以发布它</p>