<p><a href="https://samtools.github.io/hts-specs/SAMv1.pdf" rel="nofollow noreferrer">SAM spec</a>为我们提供了这个雪茄操作表,指明哪些操作“使用”查询或引用,以及如何从雪茄字符串计算序列长度的明确说明:</p>
<blockquote>
<pre><code> Consumes Consumes
Op BAM Description query reference
M 0 alignment match (can be a sequence match or mismatch) yes yes
I 1 insertion to the reference yes no
D 2 deletion from the reference no yes
N 3 skipped region from the reference no yes
S 4 soft clipping (clipped sequences present in SEQ) yes no
H 5 hard clipping (clipped sequences NOT present in SEQ) no no
P 6 padding (silent deletion from padded reference) no no
= 7 sequence match yes yes
X 8 sequence mismatch yes yes
</code></pre>
<ul>
<li>“Consumes query” and “consumes reference” indicate whether the CIGAR operation causes the alignment to step along the query sequence and the reference sequence respectively.</li>
</ul>
<p>...</p>
<ul>
<li>Sum of lengths of the M/I/S/=/X operations shall equal the length of SEQ.</li>
</ul>
</blockquote>
<p>这使得我们可以通过将雪茄中所有“consumers query”操作的长度相加来计算序列的长度。这正是在雪茄模块中发生的事情(参见<a href="https://github.com/brentp/cigar/blob/754cfed348364d390ec1aa40c951362ca1041f7a/cigar.py#L88-L93" rel="nofollow noreferrer">https://github.com/brentp/cigar/blob/754cfed348364d390ec1aa40c951362ca1041f7a/cigar.py#L88-L93</a>),所以我不知道为什么这里的OP认为模块的实现是错误的。在</p>
<p>如果我们从(已经非常短的)雪茄模块中提取出相关的代码,那么我们只剩下这样一个简短的Python实现,它是上面引述的求和操作的一个简短的Python实现:</p>
<pre class="lang-py prettyprint-override"><code>from itertools import groupby
def query_len(cigar_string):
"""
Given a CIGAR string, return the number of bases consumed from the
query sequence.
"""
read_consuming_ops = ("M", "I", "S", "=", "X")
result = 0
cig_iter = groupby(cigar_string, lambda chr: chr.isdigit())
for _, length_digits in cig_iter:
length = int(''.join(length_digits))
op = next(next(cig_iter)[1])
if op in read_consuming_ops:
result += length
return result
</code></pre>