<p>与regex相比,这有点冗长,但您可以找到连字符的索引,并使用第一个差异对它们进行分组:</p>
<pre><code>>>> def get_seq_gaps(seq):
... gaps = np.array([i for i, el in enumerate(seq) if el == '-'])
... diff = np.cumsum(np.append([False], np.diff(gaps) != 1))
... un = np.unique(diff)
... yield len(un)
... for i in un:
... subseq = gaps[diff == i]
... yield i + 1, len(subseq), subseq.min(), subseq.max()
>>> def report_gaps(seq):
... gaps = get_seq_gaps(seq)
... print('Number of gaps = %s\n' % next(gaps), sep='')
... for (i, l, mn, mx) in gaps:
... print('Index Position of Gap region %s = %s to %s' % (i, mn, mx))
... print('Length of Gap Region %s = %s\n' % (i, l), sep='')
>>> seq = 'ATC----GCTGTA--A-----T'
>>> report_gaps(seq)
Number of gaps = 3
Index Position of Gap region 1 = 3 to 6
Length of Gap Region 1 = 4
Index Position of Gap region 2 = 13 to 14
Length of Gap Region 2 = 2
Index Position of Gap region 3 = 16 to 20
Length of Gap Region 3 = 5
</code></pre>
<p>首先,这将形成一个索引数组,其中包含连字符:</p>
^{pr2}$
<p>第一个差异不是1的地方表示中断。再加一个假以保持长度。</p>
<pre><code>>>> diff
array([0, 0, 0, 0, 1, 1, 2, 2, 2, 2, 2])
</code></pre>
<p>现在取这些组的唯一元素,将<code>gaps</code>约束到相应的索引,并找到其最小/最大值</p>