用python解析文本中的id

>gi|124486857|ref|NP_001074751.1| inhibitor of Bruton tyrosine kinase [Mus musculus] >gi|341941060|sp|Q6ZPR6.3|IBTK_MOUSE RecName: Full=Inhibitor of Bruton tyrosine kinase; Short=IBtk >gi|148694536|gb|EDL26483.1| mCG128548, isoform CRA_d [Mus musculus] >gi|223460980|gb|AAI37799.1| Ibtk protein [Mus musculus]

3条回答

网友

1楼 · 编辑于 2024-07-05 14:58:06

正则表达式应该有效

import re
re.findall('gb\|([^\|]*)\|', 'gb|AB1234|')

网友

2楼 · 编辑于 2024-07-05 14:58:06

在|管道上拆分，然后跳过所有内容，直到第一个gb；下一个元素是ID:

from itertools import dropwhile

text = iter(text.split('|'))
next(dropwhile(lambda s: s != 'gb', text))
id = next(text)

演示：

>>> text = '>gi|124486857|ref|NP_001074751.1| inhibitor of Bruton tyrosine kinase [Mus musculus] >gi|341941060|sp|Q6ZPR6.3|IBTK_MOUSE RecName: Full=Inhibitor of Bruton tyrosine kinase; Short=IBtk >gi|148694536|gb|EDL26483.1| mCG128548, isoform CRA_d [Mus musculus] >gi|223460980|gb|AAI37799.1| Ibtk protein [Mus musculus]'
>>> text = iter(text.split('|'))
>>> next(dropwhile(lambda s: s != 'gb', text))
'gb'
>>> id = next(text)
>>> id
'EDL26483.1'

换句话说，不需要正则表达式。你知道吗

将其转换为生成器方法以获取所有ID：

from itertools import dropwhile

def extract_ids(text):
    text = iter(text.split('|'))
    while True:
        next(dropwhile(lambda s: s != 'gb', text))
        yield next(text)

这将提供：

>>> text = '>gi|124486857|ref|NP_001074751.1| inhibitor of Bruton tyrosine kinase [Mus musculus] >gi|341941060|sp|Q6ZPR6.3|IBTK_MOUSE RecName: Full=Inhibitor of Bruton tyrosine kinase; Short=IBtk >gi|148694536|gb|EDL26483.1| mCG128548, isoform CRA_d [Mus musculus] >gi|223460980|gb|AAI37799.1| Ibtk protein [Mus musculus]'
>>> list(extract_ids(text))
['EDL26483.1', 'AAI37799.1']

或者可以在一个简单的循环中使用它：

for id in extract_ids(text):
    print id

网友

3楼 · 编辑于 2024-07-05 14:58:06

在这种情况下，您可以不使用regexp获取，只需按“| gb |”拆分，然后按“|”拆分第2部分，并获取第一项：

s = 'the string from the question'
r = s.split('|gb|')
r.split('|')[0]

当然，如果第一个拆分的返回列表包含多于/少于2个项目，则必须添加check，但我认为这将比使用regexp更快。你知道吗

相关问题更多 >

编程相关推荐

热门问题

热门文章