<p>在<code>|</code>管道上拆分,然后跳过所有内容,直到第一个<code>gb</code>;下一个元素是ID:</p>
<pre><code>from itertools import dropwhile
text = iter(text.split('|'))
next(dropwhile(lambda s: s != 'gb', text))
id = next(text)
</code></pre>
<p>演示:</p>
<pre><code>>>> text = '>gi|124486857|ref|NP_001074751.1| inhibitor of Bruton tyrosine kinase [Mus musculus] >gi|341941060|sp|Q6ZPR6.3|IBTK_MOUSE RecName: Full=Inhibitor of Bruton tyrosine kinase; Short=IBtk >gi|148694536|gb|EDL26483.1| mCG128548, isoform CRA_d [Mus musculus] >gi|223460980|gb|AAI37799.1| Ibtk protein [Mus musculus]'
>>> text = iter(text.split('|'))
>>> next(dropwhile(lambda s: s != 'gb', text))
'gb'
>>> id = next(text)
>>> id
'EDL26483.1'
</code></pre>
<p>换句话说,不需要正则表达式。你知道吗</p>
<p>将其转换为生成器方法以获取所有ID:</p>
<pre><code>from itertools import dropwhile
def extract_ids(text):
text = iter(text.split('|'))
while True:
next(dropwhile(lambda s: s != 'gb', text))
yield next(text)
</code></pre>
<p>这将提供:</p>
<pre><code>>>> text = '>gi|124486857|ref|NP_001074751.1| inhibitor of Bruton tyrosine kinase [Mus musculus] >gi|341941060|sp|Q6ZPR6.3|IBTK_MOUSE RecName: Full=Inhibitor of Bruton tyrosine kinase; Short=IBtk >gi|148694536|gb|EDL26483.1| mCG128548, isoform CRA_d [Mus musculus] >gi|223460980|gb|AAI37799.1| Ibtk protein [Mus musculus]'
>>> list(extract_ids(text))
['EDL26483.1', 'AAI37799.1']
</code></pre>
<p>或者可以在一个简单的循环中使用它:</p>
<pre><code>for id in extract_ids(text):
print id
</code></pre>