如何找到两个序列之间的重叠，并返回

3条回答

网友

1楼 · 编辑于 2024-10-04 05:22:12

请查看^{}库，并更准确地查看^{}：

import difflib

def get_overlap(s1, s2):
    s = difflib.SequenceMatcher(None, s1, s2)
    pos_a, pos_b, size = s.find_longest_match(0, len(s1), 0, len(s2)) 
    return s1[pos_a:pos_a+size]

s1 = "CGATTCCAGGCTCCCCACGGGGTACCCATAACTTGACAGTAGATCTC"
s2 = "GGCTCCCCACGGGGTACCCATAACTTGACAGTAGATCTCGTCCAGACCCCTAGC"

print(get_overlap(s1, s2)) # GGCTCCCCACGGGGTACCCATAACTTGACAGTAGATCTC

网友

2楼 · 编辑于 2024-10-04 05:22:12

您可以使用^{}：

d = difflib.SequenceMatcher(None,s1,s2)
>>> match = max(d.get_matching_blocks(),key=lambda x:x[2])
>>> match
Match(a=8, b=0, size=39)
>>> i,j,k = match
>>> d.a[i:i+k]
'GGCTCCCCACGGGGTACCCATAACTTGACAGTAGATCTC'
>>> d.a[i:i+k] == d.b[j:j+k]
True
>>> d.a == s1
True
>>> d.b == s2
True

网友

3楼 · 编辑于 2024-10-04 05:22:12

Knuth-Morris-Pratt算法是一个很好的方法，可以在另一个字符串中找到一个字符串（因为我看到了DNA，我猜你希望这个可以扩展到。。。十亿？）。

# Knuth-Morris-Pratt string matching
# David Eppstein, UC Irvine, 1 Mar 2002

from __future__ import generators

def KnuthMorrisPratt(text, pattern):

    '''Yields all starting positions of copies of the pattern in the text.
Calling conventions are similar to string.find, but its arguments can be
lists or iterators, not just strings, it returns all matches, not just
the first one, and it does not need the whole text in memory at once.
Whenever it yields, it will have read the text exactly up to and including
the match that caused the yield.'''

    # allow indexing into pattern and protect against change during yield
    pattern = list(pattern)

    # build table of shift amounts
    shifts = [1] * (len(pattern) + 1)
    shift = 1
    for pos in range(len(pattern)):
        while shift <= pos and pattern[pos] != pattern[pos-shift]:
            shift += shifts[pos-shift]
        shifts[pos+1] = shift

    # do the actual search
    startPos = 0
    matchLen = 0
    for c in text:
        while matchLen == len(pattern) or \
              matchLen >= 0 and pattern[matchLen] != c:
            startPos += shifts[matchLen]
            matchLen -= shifts[matchLen]
        matchLen += 1
        if matchLen == len(pattern):
            yield startPos

The link where I got the KMP python code（和一个内置的，由于运行时常量的存在，对于小问题来说更快）。

为了提高性能，可以使用前缀表和字符串的哈希窗口作为基4整数（在生物学中，我们称它们为k-mers或oligos）。；）

祝你好运！

编辑：还有一个很好的技巧，可以对包含第一个字符串中的每个前缀（n total）和第二个字符串中的每个前缀（n total）的列表进行排序。如果它们共享最大的公共子序列，则它们必须在排序列表中相邻，因此从排序列表中最接近的另一个字符串中查找元素，然后获取完全匹配的最长前缀。：）

相关问题更多 >

编程相关推荐

热门问题

热门文章

如何找到两个序列之间的重叠，并返回

相关问题 更多 >

编程相关推荐

热门问题

热门文章

相关问题更多 >