使用Python进行序列匹配

2024-09-27 21:33:55 发布

您现在位置:Python中文网/ 问答频道 /正文

我正在研究RNA序列匹配

seq = 'UCAGCUGUCAGUCAUGAUC'
sub_seq =['UGUCAG', 'CAGUCA', 'UCAGCU','GAUC']

我将子序列与序列匹配,匹配的子序列在序列下,如果没有匹配的,使用虚线。输出如下所示:

UCAGCUGUCAGUCAUGAUC
UCAGCU--CAGUCA-GAUC
-----UGUCAG--------

我试着用字典来做这件事

index_dict = {}
for i in xrange(len(sub_seq)):
    index_dict[seq.find(sub_seq[i])] = {}
    index_dict[seq.find(sub_seq[i])]['sequence'] = sub_seq[i]
    index_dict[seq.find(sub_seq[i])]['end_index'] = seq.find(sub_seq[i]) + len(sub_seq[i]) - 1

我无法找出算法做对齐,任何帮助将不胜感激!你知道吗


Tags: indexlen字典序列findseqdictrna
2条回答
seq_l = len(seq)
for ele in sub_seq:
    start = seq.find(ele)
    ln = len(ele)
    if start != -1:
        end = start + ln
        print("-" * start + ele + "-"*(seq_l- end))
    else:
        print("-" * seq_l)

  -UGUCAG    
    CAGUCA  -
UCAGCU      -
       -GAUC

不确定UCAGCU CAGUCA-GAUC来自何处,因为在代码中一次只使用一个子序列

假设您允许我稍微更改您的index_dict,请考虑:

seq = 'UCAGCUGUCAGUCAUGAUC'
sub_seq =['UGUCAG', 'CAGUCA', 'UCAGCU','GAUC']

index_dict = {}
for i in xrange(len(sub_seq)):
    index_dict[seq.find(sub_seq[i])] = {
        'sequence':  sub_seq[i],
        'end_index': seq.find(sub_seq[i]) + len(sub_seq[i])   # Note this changed
    }
sorted_keys = sorted(index_dict)

lines = []
while True:
    if not sorted_keys: break
    line = []
    next_index = 0
    for k in sorted_keys:
        if k >= next_index:
            line.append(k)
            next_index = index_dict[k]['end_index']
    # Remove keys we used, append line to lines
    for k in line: sorted_keys.remove(k)
    lines.append(line)

# Build output lines
olines = []
for line in lines:
    oline = ''
    for k in line:
        oline += '-' * (k - len(oline))     # Add dashes before subseq
        oline += index_dict[k]['sequence']  # Add subsequence
    oline += '-' * (len(seq) - len(oline))  # Add trailing dashes
    olines.append(oline)

print seq
print '\n'.join(olines)

输出:

UCAGCUGUCAGUCAUGAUC
UCAGCU CAGUCA-GAUC
  -UGUCAG    

注意,这是相当冗长的,可以压缩一点。while Truefor line in lines循环可以合并成一个循环,但它应该有助于解释一种可能的方法。你知道吗

编辑:这是连接最后两个循环的一种方法:

seq = 'UCAGCUGUCAGUCAUGAUC'
sub_seq =['UGUCAG', 'CAGUCA', 'UCAGCU','GAUC']

index_dict = {}
for i in xrange(len(sub_seq)):
    index_dict[seq.find(sub_seq[i])] = {
        'sequence':  sub_seq[i],
        'end_index': seq.find(sub_seq[i]) + len(sub_seq[i])   # Note this changed
    }
sorted_keys = sorted(index_dict)

lines = []
while True:
    if not sorted_keys: break
    line = ''
    next_index = 0
    keys_used = []
    for k in sorted_keys:
        if k >= next_index:
            line += '-' * (k - len(line))           # Add dashes before subseq
            line += index_dict[k]['sequence']       # Add subsequence
            next_index = index_dict[k]['end_index'] # Update next_index
            keys_used.append(k)                     # Mark key as used
    for k in keys_used: sorted_keys.remove(k)       # Remove used keys
    line += '-' * (len(seq) - len(line))            # Add trailing dashes
    lines.append(line)                              # Add line to lines

print seq
print '\n'.join(lines)

输出:

UCAGCUGUCAGUCAUGAUC
UCAGCU CAGUCA-GAUC
  -UGUCAG    

相关问题 更多 >

    热门问题