如何为Python加速(fasta)子采样程序?

2024-09-30 01:34:46 发布

您现在位置:Python中文网/ 问答频道 /正文

我设计了一个小脚本,从原始文件中对x行进行子采样。原始文件是fasta,每个序列有两行,程序提取x个序列(这两行放在一起)。 这就是它的样子:

#!/usr/bin/env python3
import random
import sys
# How many random sequences do you want?
num = int(input("Enter number of random sequences to select:\n"))

# Import arguments
infile = open(sys.argv[1], "r")
outfile = open(sys.argv[2], "w")

# Define lists
fNames = []
fSeqs = []
# Extract fasta file into the two lists
for line in infile:
    if line.startswith(">"):
        fNames.append(line.rstrip())
    else:
        fSeqs.append(line.rstrip())

# Print total number of sequences in the original file
print("There are "+str(len(fNames))+" in the input file")

# Take random items out of the list for the total number of samples required
for j in range(num):
    a = random.randint(0, (len(fNames)-1))
    print(fNames.pop(a), file = outfile)
    print(fSeqs.pop(a), file = outfile)

infile.close()
outfile.close()
input("Done.")

创建带有ID和核苷酸的列表(分别是第1行和第2行)非常快,但是打印出来要花很长时间。被提取的数字可以达到2百万,但从1万开始变慢。你知道吗

我想知道有没有办法让它快点。是.pop问题吗?如果我先创建一个唯一数字的随机列表,然后提取它们,会更快吗?你知道吗

最后,终端在打印Done.之后没有返回到“正常完成状态”,我不知道为什么。用我所有的其他脚本,我可以在完成后立即键入。你知道吗


Tags: oftheinnumberforinputsysline
1条回答
网友
1楼 · 发布于 2024-09-30 01:34:46

random.sample(这是在注释中建议的)和字典使脚本更快。 以下是最后的脚本:

#!/usr/bin/env python3
import random
import sys
# How many random sequences do you want?
num = int(input("Enter number of random sequences to select:\n"))

# Import arguments
infile = open(sys.argv[1], "r")
outfile = open(sys.argv[2], "w")

# Define list and dictionary
fNames = []
dicfasta = {}
# Extract fasta file into the two lists
for line in infile:
    if line.startswith(">"):
        fNames.append(line.rstrip())
        Id = line.rstrip()
    else:
        dicfasta[Id] = line.rstrip()

# Print total number of sequences in the original file
print("There are "+str(len(fNames))+" in the input file")

# Create subsamples
subsample = []
subsample = random.sample(fNames, num)

# Take random items out of the list for the total number of samples required
for j in subsample:
    print(j, file = outfile)
    print(dicfasta[j], file = outfile)

infile.close()
outfile.close()
input("Done.")

相关问题 更多 >

    热门问题