使用Python来减少表格形式的BLAST输出文件中的命中数

2024-09-30 16:42:31 发布

您现在位置:Python中文网/ 问答频道 /正文

我有一个表格格式的大blast文件,目标序列的数量不受限制,所以解析需要很长时间。我希望将每个查询序列的命中数减少到前10个。 我的python是基本的,但是到目前为止我所拥有的

import sys

blastfile = open(sys.argv[1],"r")

column1list=[]

for line in blastfile:
    b = line.split()[0]
    column1list.append(b)

uniqcolumn1 = list(set(column1list))

counter = 0

for val in uniqcolumn1:
    #print val
    for line in blastfile:
        #print line
        while counter <= 10:
            if line.startswith(val):
                print line
                counter =+ 1

下面是blast输出文件的一行示例,查询序列的标题在第一列中,在本例中为“c8208”g1_i2

^{pr2}$

我认为代码的第一部分工作正常,直到' uniqcolumn1=list(set(column1list))',那么我无法让它打印以列表中每个字符串开头的前十行。在


Tags: 文件inforsyslinecounter序列val
2条回答

此一次性版本按标题在文件中出现的顺序打印每个标题的前10个:

import sys
NUM_TO_PRINT=10     # good practice - use names rather than raw numbers

blastfile = open(sys.argv[1],"r")

titles={};  # an empty dictionary.
    # This will map titles to counts of how many times a line with that title
    # has been printed.

for line in blastfile:
    title = line.split()[0];    # assuming the title is space-delimited, and that the line is not empty
    num_printed = titles.get(title, 0);     # 0 is the default
    if num_printed<NUM_TO_PRINT:
        print line,    # comma because _line_ already has a newline - without the comma, you get a blank line after every printed line
        num_printed += 1
        titles[title] = num_printed     # save where we are 

这里的问题似乎是您在文件对象中迭代了两次。在Python中,文件对象的工作方式非常类似于遍历每一行的指针。如果不向后移动指针,它就没有可读取的内容。在

您需要做的是使用.seek函数将该指针移回起始位置。例如,假设你有一个file_to_read.txt和你的python_script.py。在

文件_阅读.txt

Hello! My name is Bob and I can't think of anything to
put in this file so I'm blabbering on about nonsense
in hopes that you won't realise that this text is not
important but the code in the actually file, though I
think that you wouldn't mind reading this long file.

Python_脚本.py

^{pr2}$

如果要运行这段代码(并且没有出现有关目录的错误),则只需打印一次file_to_read.txt。要解决这个问题,您只需在阅读之间添加一个f.seek(0, 0)。例如:

f = open("file_to_read.txt", "r")
for line in f: print line
f.seek(0, 0)
for lien in f: print line

现在,回到您的上下文,您可以看到这是如何应用于您的代码的:

import sys
# Here is your reading of file
blastfile = open(sys.argv[1],"r")
column1list = []
# Here is the first time you read the file
for line in blastfile:
    b = line.split()[0]
    column1list.append(b)
# Add a line to move back to the start before the
# next reading
blastfile.seek(0, 0)

uniqcolumn1 = list(set(column1list))

for val in uniqcolumn1:
    # Move the counter inside to refresh it after every iteration
    counter = 0
    # Here is the second time you read your file
    for line in blastfile:
        while counter <= 10:
            if line.startswith(val):
                print line
                counter += 1
    # Since you are going to read the file the next iteration,
    # .seek the file
    blastfile.seek(0, 0)

编辑

这是代码的后半部分,已修复。你也可以这样做:

for val in uniqcolumn1:
    # Move the counter in
    counter = 0
    # Move the while loop out
    while counter <= 10:
        for line in blastfile:
            if line.startswith(val):
                print line,
                counter += 1
    blastfile.seek(0, 0)

这样做的好处是for循环提前终止,它不会读取整个文件。在

另一种方法是:

for val in uniqcolumn1:
    # Move counter in
    counter = 0
    # Remove while statement
    for line in blastfile:
        # Add additional condition to if statement
        if line.startswith(val) and counter <= 10:
            print line,
            counter += 1
        elif counter > 10:
            break
    blastfile.seek(0, 0)

这样做的好处是看起来更简单。在

相关问题 更多 >