找到两个字符串共同拥有的所有长度为n的子字符串的最大长度问题的回答

找到两个字符串共同拥有的所有长度为n的子字符串的最大长度

回答此问题可获得 20 贡献值，回答如果被采纳可获得 50 分。

我正在开发一个Python脚本，它可以找到两个字符串共享的所有n个单词长度的子字符串的长度（可能最长），而不考虑后面的标点符号。给出两个字符串： <blockquote> "this is a sample string" "this is also a sample string" </blockquote> 我想让脚本标识这些字符串有一个共有2个单词的序列（“this is”），后跟一个3个单词的序列（“示例字符串”）。以下是我目前的方法： <pre><code>a = "this is a sample string" b = "this is also a sample string" aWords = a.split() bWords = b.split() #create counters to keep track of position in string currentA = 0 currentB = 0 #create counter to keep track of longest sequence of matching words matchStreak = 0 #create a list that contains all of the matchstreaks found matchStreakList = [] #create binary switch to control the use of while loop continueWhileLoop = 1 for word in aWords: currentA += 1 if word == bWords[currentB]: matchStreak += 1 #to avoid index errors, check to make sure we can move forward one unit in the b string before doing so if currentB + 1 < len(bWords): currentB += 1 #in case we have two identical strings, check to see if we're at the end of string a. If we are, <a href="https://www.cnpython.com/list/append" class="inner-link">append</a> value of match streak to list of match streaks if currentA == len(aWords): matchStreakList.append(matchStreak) elif word != bWords[currentB]: #because the streak is broken, check to see if the streak is >= 1. If it is, append the streak counter to out list of streaks and then reset the counter if matchStreak >= 1: matchStreakList.append(matchStreak) matchStreak = 0 while word != bWords[currentB]: #the two words don't match. If you can move b forward one word, do so, then check for another match if currentB + 1 < len(bWords): currentB += 1 #if you have advanced b all the way to the end of string b, then rewind to the beginning of string b and advance a, looking for more matches elif currentB + 1 == len(bWords): currentB = 0 break if word == bWords[currentB]: matchStreak += 1 #now that you have a match, check to see if you can advance b. If you can, do so. Else, rewind b to the beginning if currentB + 1 < len(bWords): currentB += 1 elif currentB + 1 == len(bWords): #we're at the end of string b. If we are also at the end of string a, check to see if the value of matchStreak >= 1. If so, add matchStreak to matchStreakList if currentA == len(aWords): matchStreakList.append(matchStreak) currentB = 0 break print matchStreakList </code></pre> 这个脚本正确地输出了公共字长子串（2，3）的（最大）长度，目前为止所有的测试都是这样。我的问题是：有没有一对两个字符串，上面的方法不起作用？更重要的是：有没有现存的Python库或著名的方法可以用来找到两个字符串共享的所有n个字长的子字符串的最大长度？在 [这个问题不同于最长公共子串问题，它只是我所要寻找的东西的一个特例（因为我想要找到所有的公共子串，而不仅仅是最长的公共子串）。<a href="https://stackoverflow.com/questions/1410822/how-can-i-detect-common-substrings-in-a-list-of-strings">This SO post</a>建议1）聚类分析、2）编辑距离例程和3）最长公共序列算法等方法可能是合适的方法，但我没有找到任何有效的解决方案，而且我的问题可能比链接中提到的稍微简单一些，因为我处理的是空格限制的单词。] 编辑： 我开始悬赏这个问题。如果这会对其他人有所帮助，我想快速澄清几点。首先，@DhruvPathak下面给出的有用答案并没有找到两个字符串共享的所有最大长度为n个单词的子字符串。例如，假设我们分析的两个字符串是： <blockquote> "They all are white a sheet of spotless paper when they first are born but they are to be scrawled upon and blotted by every goose quill" </blockquote> 以及 <blockquote> "You are all white, a sheet of lovely, spotless paper, when you first are born; but you are to be scrawled and blotted by every goose's quill" </blockquote> 在这种情况下，最长的n个单词长度的子字符串列表（忽略后面的标点符号）是： ^{pr2}$ 使用以下例程： <pre><code>#import required packages import difflib #define function we'll use to identify matches def matches(first_string,second_string): s = difflib.SequenceMatcher(None, first_string,second_string) match = [first_string[i:i+n] for i, j, n in s.get_matching_blocks() if n > 0] return match a = "They all are white a sheet of spotless paper when they first are born but they are to be scrawled upon and blotted by every goose quill" b = "You are all white, a sheet of lovely, spotless paper, when you first are born; but you are to be scrawled and blotted by every goose's quill" a = a.replace(",", "").replace(":","").replace("!","").replace("'","").replace(";","").lower() b = b.replace(",", "").replace(":","").replace("!","").replace("'","").replace(";","").lower() print matches(a,b) </code></pre> 一个得到输出： <pre><code>['e', ' all', ' white a sheet of', ' spotless paper when ', 'y', ' first are born but ', 'y', ' are to be scrawled', ' and blotted by every goose', ' quill'] </code></pre> 首先，我不确定如何从这个列表中选择只包含整词的子字符串。第二，该列表不包括“are”，即期望的最长公共n字长度子串之一。有没有一种方法可以找到这两个字符串共享的所有最长n字长的子字符串（“You all…”和“They all…”）？在

0 条评论
分类：Python问答

默认排序时间排序

1 个回答

匿名 1天前

　擅长：python、mysql、java

这里仍然有模棱两可的地方，我不想花时间去争论它们。但我想我可以补充一些有用的东西；-） 我编写了Python的<code>difflib.SequenceMatcher</code>，并花了大量的时间寻找预期的快速查找最长公共子串的方法。理论上，这应该用“后缀树”来完成，或者用“最长公共前缀数组”来扩充相关的“后缀数组”（如果你想用Google搜索更多的话，引号中的短语就是搜索词）。这些方法可以在最坏情况下解决线性时间问题。但是，就像有时的情况一样，最坏情况下的线性时间算法非常复杂和微妙，并且受到很大的常量因素的影响——如果给定的语料库要被搜索多次，它们仍然可以获得巨大的回报，但这不是Python的<code>difflib</code>的典型情况，而且看起来也不像你的情况。在 总之，我在这里的贡献是重写<code>SequenceMatcher</code>的<code>find_longest_match()</code>方法，以返回所有它在过程中找到的（本地）最大匹配。注意事项： <ol> <li>我将使用raymondhettinger给您的<code>to_words()</code>函数，但是没有转换成小写。转换为小写会导致输出与您所说的不完全相同。</li> <li>然而，正如我已经在一篇评论中提到的，这确实输出了“羽毛笔”，它不在您想要的输出列表中。我不知道为什么不是，因为“羽毛笔”确实出现在两个输入中。</li> </ol> 代码如下： <pre><code>import re def to_words(text): 'Break text into a list of words without punctuation' return re.findall(r"[a-zA-Z']+", text) def match(a, b): # Make b the longer list. if len(a) > len(b): a, b = b, a # Map each word of b to a list of indices it occupies. b2j = {} for j, word in enumerate(b): b2j.setdefault(word, []).append(j) j2len = {} nothing = [] unique = set() # set of all results def local_max_at_j(j): # maximum match ends with b[j], with length j2len[j] length = j2len[j] unique.add(" ".join(b[j-length+1: j+1])) # during an iteration of the loop, j2len[j] = length of longest # match ending with b[j] and the previous word in a for word in a: # look at all instances of word in b j2lenget = j2len.get newj2len = {} for j in b2j.get(word, nothing): newj2len[j] = j2lenget(j-1, 0) + 1 # which indices have not been extended? those are # (local) maximums for j in j2len: if j+1 not in newj2len: local_max_at_j(j) j2len = newj2len # and we may also have local maximums ending at the last word for j in j2len: local_max_at_j(j) return unique </code></pre> 然后： ^{pr2}$ 显示： <pre><code>set(['all', 'and blotted by every', 'first are born but', 'are to be scrawled', 'are', 'spotless paper when', 'white a sheet of', 'quill']) </code></pre> 编辑-工作原理 许多序列匹配和比对算法最好理解为在二维矩阵上工作，其中包含计算矩阵项的规则，并随后解释这些项的含义。在 对于输入序列<code>a</code>和<code>b</code>，用<code>len(a)</code>行和<code>len(b)</code>列描绘一个矩阵<code>M</code>。在这个应用程序中，我们希望<code>M[i, j]</code>包含以<code>a[i]</code>和<code>b[j]</code>结尾的最长公共连续子序列的长度，并且计算规则非常： <ol> <li><code>M[i, j] = 0</code>如果<code>a[i] != b[j]</code>。在</li> <li><code>M[i, j] = M[i-1, j-1] + 1</code>如果<code>a[i] == b[j]</code>（我们假设一个越界矩阵引用静默返回0）。在</li> </ol> 在这种情况下，解释也非常容易：存在一个以<code>a[i]</code>和<code>b[j]</code>结尾的局部最大匹配，当且仅当<code>M[i, j]</code>非零，但<code>M[i+1, j+1]</code>为0或超出边界。在 您可以使用这些规则编写非常简单和紧凑的代码，有两个循环，可以正确地计算出这个问题的<code>M</code>。缺点是代码将占用（最佳、平均和最坏情况）<code>O(len(a) * len(b))</code>时间和空间。在 虽然一开始可能会让人困惑，但我发布的代码正是实现了上述目的。由于代码在许多方面针对预期情况进行了大量优化，因此连接变得模糊： <ul> <li>不需要执行一个过程来计算<code>M</code>，然后再进行另一个过程来解释结果，而是在<code>a</code>上交叉进行计算和解释。</li> <li>因此，不需要存储整个矩阵。相反，只有当前行（<code>newj2len</code>）和前一行（<code>j2len</code>）同时存在。</li> <li>由于这个问题中的矩阵通常是零，所以这里的一行通过dict映射列索引到非零值来稀疏表示。零项是“免费的”，因为它们从来没有显式地存储。</li> <li>在处理一行时，不需要遍历每一列：预计算的<code>b2j</code>dict准确地告诉我们当前行中有趣的列索引（那些与<code>a</code>中的当前<code>word</code>匹配的列）。</li> <li>最后，部分是偶然的，所有之前的优化都是以这样一种方式进行的：从来没有需要知道当前行的索引，所以我们也不必费心计算它。</li> </ul> 编辑-简单版 下面的代码直接实现2D矩阵，不需要尝试优化（除了<code>Counter</code>通常可以避免显式地存储0个条目）。它非常简单、简短和简单： <pre><code>def match(a, b): from collections import Counter M = Counter() for i in range(len(a)): for j in range(len(b)): if a[i] == b[j]: M[i, j] = M[i-1, j-1] + 1 unique = set() for i in range(len(a)): for j in range(len(b)): if M[i, j] and not M[i+1, j+1]: length = M[i, j] unique.add(" ".join(a[i+1-length: i+1])) return unique </code></pre> 当然；-）返回的结果与我最初发布的优化的<code>match()</code>相同。在 编辑-另一个没有dict的 只是为了好玩：-）如果你有矩阵模型，这段代码将很容易理解。关于这个特殊的问题，一个矩阵单元的值只依赖于沿着单元格西北方向的对角线的值。所以只要穿过所有的主对角线，从西边界和北边界的所有单元格向东南方向移动就足够了。这样，无论输入的长度如何，只需要很小的恒定空间： <pre><code>def match(a, b): from itertools import chain m, n = len(a), len(b) unique = set() for i, j in chain(((i, 0) for i in xrange(m)), ((0, j) for j in xrange(1, n))): k = 0 while i < m and j < n: if a[i] == b[j]: k += 1 elif k: unique.add(" ".join(a[i-k: i])) k = 0 i += 1 j += 1 if k: unique.add(" ".join(a[i-k: i])) return unique </code></pre>

找到两个字符串共同拥有的所有长度为n的子字符串的最大长度

1 个回答

相关Python问题