如何压缩两个非常大的字符串并返回匹配和不匹配的索引?

2024-09-23 20:18:26 发布

您现在位置:Python中文网/ 问答频道 /正文

我有一组文本文件,其中包含两个长度相同的非常大的字符集。字符集是DNA序列,所以我将它们称为seq_1和{},它们一起被称为alignment。文件如下所示:

>HEADER1
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
>HEADER2
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN

>HEADER1下序列1中的可能字符是ACGTN-,而{}下序列2中的可能字符是ACGTN-*。在

我想分析序列并返回两个索引列表,我将分别调用valid和{}。在

valid包含两个序列(“对齐”)中的位置在集合中的所有(基于1的)索引ACGTmismatch包含对齐中的位置在集合ACGT中但彼此不匹配的所有(基于1的)索引。因此mismatch是{}的子集。在

最后一个条件是,在序列1是"-"的位置,我做递增索引计数器,因为这些在我使用的坐标系中被认为是“间隙”。在

此示例显示了我的预期输出的对齐方式:

^{pr2}$

我希望改进我当前的代码(如下所示),其中包括对序列的regex提取、压缩和枚举非间隙站点到生成器对象中,然后——主要耗时的步骤——在生成器中循环并填充两个列表。我觉得必须有一个基于数组或itertools的解决方案来解决这个问题,它比通过序列及其索引压缩在一起的for循环更有效,我正在寻找建议。在

代码:

^{3}$

编辑:根据流行的需求,这里有一个链接指向某个文件,供那些想测试代码的人使用: https://www.dropbox.com/s/i904fil7cvv1vco/chr1_homo_maca_100Multiz.fa?dl=0


Tags: 文件代码列表序列字符dna文本文件valid
2条回答

这是我的版本(但你可能开始向我扔石头混淆)请张贴一些较长的测试数据,我想测试它。。。在

seq1='CT-A-CG'
seq2='CNCTA*G'

import numpy as np
def is_not_gap(a): return 0 if (a=='-') else 1
def is_valid(a): return 1 if (a=='A' or a=='C' or a=='G' or a=='T' ) else 0
def is_mm(t): return 0 if t[0]==t[1] else 1

# find the indexes to be retained
retainx=np.where( np.multiply( map(is_not_gap, seq1), map(is_not_gap, seq2) ) )[0].tolist()

# find the valid ones
valid0=np.where(np.multiply( map( is_valid, seq1),map( is_valid, seq2))[retainx])[0].tolist()

# find the mismatches
mm=np.array(map( is_mm, zip( seq1,seq2)))
mismatch0=set(valid0) & set(np.where(mm[retainx])[0])

结果(从零开始的索引):

^{pr2}$

(如果你愿意,我可以发布更长、更详细的版本)

读了你的代码,我可以看出你是一个聪明的家伙,所以我要给你一些完全未经测试的东西,让你弄清楚如何让它工作,以及是否有任何一个比你现有的更快:-)

(嘿,不管怎样,你的问题并不是给了一个真实的数据集…)

编辑使用十六进制数字来计算不匹配。在

#! /usr/bin/env python2.7

# Text is much faster in 2.7 than 3...

def rm_gaps(seq1, seq2):
    ''' Given a first sequence with gaps removed,
        do the same operation to a second sequence.
    '''
    startpos = 0
    for substring in seq1:
        length = len(substring)
        yield seq2[startpos:length]
        startpos += length + 1

def seq_divergence(infile):

    # Build a character translation map with ones
    # in the positions of valid bases, and
    # another one with hex numbers for each base.

    transmap_v = ['0'] * 256
    transmap_m = ['0'] * 256
    for ch, num in zip('ACGT', '1248'):
        transmap_v[ord(ch)] = '1'
        transmap_m[ord(ch)] = num
    transmap_v = ''.join(transmap_v)
    transmap_m = ''.join(transmap_m)


    # No need to do everything inside open   you are
    # done with the file once you have read it in.
    with open(infile, 'rb') as f:
        alignment = f.read()

    # For this case, using regular string stuff might be faster than re

    h1 = '>HEADER1\n'
    h2 = h1.replace('1', '2')

    h1loc = alignment.find(h1)
    h2loc = alignment.find(h2)

    # This assumes header 1 comes before header 2.  If that is
    # not invariant, you will need more code here.

    seq1 = alignment[h1loc + len(h1):h2loc].replace('\n','')
    seq2 = alignment[h2loc + len(h2):].replace('\n','')

    # Remove the gaps
    seq1 = seq1.split('-')
    seq2 = rm_gaps(seq1, seq2)
    seq1 = ''.join(seq1)
    seq2 = ''.join(seq2)

    assert len(seq1) == len(seq2)

    # Let's use Python's long integer capability to
    # find locations where both sequences are valid.
    # Convert each sequence into a binary number,
    # and AND them together.
    num1 = int(seq1.translate(transmap_v), 2)
    num2 = int(seq2.translate(transmap_v), 2)
    valid = ('{0:0%db}' % len(seq1)).format(num1 & num2)
    assert len(valid) == len(seq1)


    # Now for the mismatch   use hexadecimal instead
    # of binary here.  The 4 bits per character position
    # nicely matches our 4 possible bases.
    num1 = int(seq1.translate(transmap_m), 16)
    num2 = int(seq2.translate(transmap_m), 16)
    mismatch = ('{0:0%dx}' % len(seq1)).format(num1 & num2)
    assert len(match) == len(seq1)

    # This could possibly use some work.  For example, if
    # you expect very few invalid and/or mismatches, you
    # could split on '0' in both these cases, and then
    # use the length of the strings between the zeros
    # and concatenate ranges for valid, or use them as
    # skip distances for the mismatches.

    valid = [x for x, y in enumerate(valid,1) if y == '1']
    mismatch = [x for x, y in enumerate(mismatch, 1) if y == '0']

    return valid, mismatch

相关问题 更多 >