如何使用python正则表达式将每个爆炸结果分离,并将其存储在列表中供进一步分析

2024-09-29 17:51:05 发布

您现在位置:Python中文网/ 问答频道 /正文

我正在研究一组涉及使用ncbi-blast的生物序列。我需要一些使用pythonregex处理输出文件的帮助。包含多个输出(序列分析结果)的文本结果如下所示

Query= lcl|TRINITY_DN2888_c0_g2_i1

Length=1394 Score E Sequences producing significant alignments:
(Bits) Value

sp|Q9S775|PKL_ARATH

CHD3-type chromatin-remodeling factor PICKLE... 1640 0.0

sp|Q9S775|PKL_ARATH CHD3-type chromatin-remodeling factor PICKLE OS=Arabidopsis thaliana OX=3702 GN=PKL PE=1 SV=1 Length=1384

Score = 1640 bits (4248), Expect = 0.0, Method: Compositional matrix adjust. Identities = 830/1348 (62%), Positives = 1036/1348 (77%), Gaps = 53/1348 (4%)

Query 1
MSSLVERLRVRSERRPLYTDDDSDDDLYAARGGSESKQEERPPERIVRDDAKNDTCKTCG 60 MSSLVERLR+RS+R+P+Y DDSDDD + + +Q E IVR DAK + C+ CG Sbjct 1
MSSLVERLRIRSDRKPVYNLDDSDDDDFVPKKDRTFEQ----VEAIVRTDAKENACQACG 56

Lambda K H a alpha 0.317 0.134 0.389 0.792 4.96

Gapped Lambda K H a alpha sigma 0.267 0.0410 0.140 1.90 42.6 43.6

Effective search space used: 160862965056

Query= lcl|TRINITY_DN2855_c0_g1_i1

Length=145 ........................................ ................................................... ...................................................

我想将从“Query=lcl | TRINITY_DN2888_c0_g2_i1”开始的信息提取到下一个查询“Query=lcl | TRINITY_DN2855_c0_g1_i1”的信息,并将其存储在python列表中以供进一步分析(因为整个文件包含几千个查询结果)。有没有python regex代码可以执行此操作?在

这是我的代码:

#!/user/bin/python3
file=open("path/file_name","r+")
import re
inter=file.read()
lst=[]
lst=re.findall(r'>(.*)>',inter,re.DOTALL)
print(lst)
for x in lst:
    print(x)

我得到了错误的输出,因为代码打印文件中的全部信息(数千个),而不是一次只提取一个结果。在

谢谢你


Tags: 文件代码re序列querylengthfilepkl
2条回答

我终于找到了将大文件分成小块的解决方案,这样我就可以使用python正则表达式处理单个查询结果。。。这是我的密码。。。在

#!/user/bin/python3
file=open("/path/file_name.txt","r+")
import re
inter=file.read()
lst=re.findall('(?<=Query= lcl)(.*?)(?=Effective search space)', inter, flags=re.S)
print(lst)

谢谢你们帮我。。。在

要获得所需的结果,请使用re.split()对以下内容调用re.findall()编辑该行:

lst=re.split(r'(>Query\=.*)?',inter,re.DOTALL)

有关re.split()的详细信息,请参阅此部分:

https://docs.python.org/2/library/re.html

另外,您可能需要考虑在biopython中使用现已弃用的BLAST解析器:

http://biopython.org/DIST/docs/tutorial/Tutorial.html#htoc96

The plain text BLAST parser is located in Bio.Blast.NCBIStandalone.

As with the XML parser, we need to have a handle object that we can pass to the parser. The handle must implement the readline() method and do this properly. The common ways to get such a handle are to either use the provided blastall or blastpgp functions to run the local blast, or to run a local blast via the command line, and then do something like the following:

result_handle = open("my_file_of_blast_output.txt")

现在我们有了一个句柄(我们称之为result\u handle),我们 准备好解析它了。这可以通过以下代码完成:

^{pr2}$

This will parse the BLAST report into a Blast Record class (either a Blast or a PSIBlast record, depending on what you are parsing) so that you can extract the information from it. In our case, let’s just print out a quick summary of all of the alignments greater than some threshold value.

>>> E_VALUE_THRESH = 0.04
>>> for alignment in blast_record.alignments: 
...     for hsp in alignment.hsps: 
...         if hsp.expect < E_VALUE_THRESH: 
...             print('****Alignment****') 
...             print('sequence:', alignment.title) 
...             print('length:', alignment.length)
...             print('e value:', hsp.expect) 
...             print(hsp.query[0:75] + '...') 
...             print(hsp.match[0:75] + '...') 
...             print(hsp.sbjct[0:75] + '...')

If you also read the section 7.3 on parsing BLAST XML output, you’ll notice that the above code is identical to what is found in that section. Once you parse something into a record class you can deal with it independent of the format of the original BLAST info you were parsing. Pretty snazzy!

相关问题 更多 >

    热门问题