如何压缩两个非常大的字符串并返回匹配和不匹配的索引？问题的回答

如何压缩两个非常大的字符串并返回匹配和不匹配的索引？

回答此问题可获得 20 贡献值，回答如果被采纳可获得 50 分。

我有一组文本文件，其中包含两个长度相同的非常大的字符集。字符集是DNA序列，所以我将它们称为<code>seq_1</code>和{<cd2>}，它们一起被称为<code>alignment</code>。文件如下所示： <pre><code>>HEADER1 NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN >HEADER2 NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN </code></pre> <code>>HEADER1</code>下序列1中的可能字符是<code>ACGTN-</code>，而{<cd6>}下序列2中的可能字符是<code>ACGTN-*</code>。在 我想分析序列并返回两个索引列表，我将分别调用<code>valid</code>和{<cd9>}。在 <code>valid</code>包含两个序列（“对齐”）中的位置在集合中的所有（基于1的）索引<code>ACGT</code>；<code>mismatch</code>包含对齐中的位置在集合<code>ACGT</code>中但彼此不匹配的所有（基于1的）索引。因此<code>mismatch</code>是{<cd8>}的子集。在 最后一个条件是，在序列1是<code>"-"</code>的位置，我做不递增索引计数器，因为这些在我使用的坐标系中被认为是“间隙”。在 此示例显示了我的预期输出的对齐方式： ^{pr2}$ 我希望改进我当前的代码（如下所示），其中包括对序列的regex提取、压缩和枚举非间隙站点到生成器对象中，然后——主要耗时的步骤——在生成器中循环并填充两个列表。我觉得必须有一个基于数组或<code>itertools</code>的解决方案来解决这个问题，它比通过序列及其索引压缩在一起的for循环更有效，我正在寻找建议。在 代码： ^{3}$ 编辑：根据流行的需求，这里有一个链接指向某个文件，供那些想测试代码的人使用： <a href="https://www.dropbox.com/s/i904fil7cvv1vco/chr1_homo_maca_100Multiz.fa?dl=0" rel="nofollow">https://www.dropbox.com/s/i904fil7cvv1vco/chr1_homo_maca_100Multiz.fa?dl=0</a>

0 条评论
分类：Python问答

默认排序时间排序

1 个回答

匿名 1天前

　擅长：python、mysql、java

读了你的代码，我可以看出你是一个聪明的家伙，所以我要给你一些完全未经测试的东西，让你弄清楚如何让它工作，以及是否有任何一个比你现有的更快：-） （嘿，不管怎样，你的问题并不是给了一个真实的数据集…） 编辑使用十六进制数字来计算不匹配。在 <pre><code>#! /usr/bin/env python2.7 # Text is much faster in 2.7 than 3... def rm_gaps(seq1, seq2): ''' Given a first sequence with gaps removed, do the same operation to a second sequence. ''' startpos = 0 for substring in seq1: length = len(substring) yield seq2[startpos:length] startpos += length + 1 def seq_divergence(infile): # Build a character translation map with ones # in the positions of valid bases, and # another one with hex numbers for each base. transmap_v = ['0'] * 256 transmap_m = ['0'] * 256 for ch, num in zip('ACGT', '1248'): transmap_v[ord(ch)] = '1' transmap_m[ord(ch)] = num transmap_v = ''.join(transmap_v) transmap_m = ''.join(transmap_m) # No need to do everything inside open you are # done with the file once you have read it in. with open(infile, 'rb') as f: alignment = f.read() # For this case, using regular string stuff might be faster than re h1 = '>HEADER1\n' h2 = h1.replace('1', '2') h1loc = alignment.find(h1) h2loc = alignment.find(h2) # This assumes header 1 comes before header 2. If that is # not invariant, you will need more code here. seq1 = alignment[h1loc + len(h1):h2loc].replace('\n','') seq2 = alignment[h2loc + len(h2):].replace('\n','') # Remove the gaps seq1 = seq1.split('-') seq2 = rm_gaps(seq1, seq2) seq1 = ''.join(seq1) seq2 = ''.join(seq2) assert len(seq1) == len(seq2) # Let's use Python's long integer capability to # find locations where both sequences are valid. # Convert each sequence into a binary number, # and AND them together. num1 = int(seq1.translate(transmap_v), 2) num2 = int(seq2.translate(transmap_v), 2) valid = ('{0:0%db}' % len(seq1)).format(num1 & num2) assert len(valid) == len(seq1) # Now for the mismatch use hexadecimal instead # of binary here. The 4 bits per character position # nicely matches our 4 possible bases. num1 = int(seq1.translate(transmap_m), 16) num2 = int(seq2.translate(transmap_m), 16) mismatch = ('{0:0%dx}' % len(seq1)).format(num1 & num2) assert len(match) == len(seq1) # This could possibly use some work. For example, if # you expect very few invalid and/or mismatches, you # could split on '0' in both these cases, and then # use the length of the strings between the zeros # and concatenate ranges for valid, or use them as # skip distances for the mismatches. valid = [x for x, y in enumerate(valid,1) if y == '1'] mismatch = [x for x, y in enumerate(mismatch, 1) if y == '0'] return valid, mismatch </code></pre>

如何压缩两个非常大的字符串并返回匹配和不匹配的索引？

1 个回答

相关Python问题