找到两个字符串共同拥有的所有长度为n的子字符串的最大长度问题的回答

找到两个字符串共同拥有的所有长度为n的子字符串的最大长度

回答此问题可获得 20 贡献值，回答如果被采纳可获得 50 分。

我正在开发一个Python脚本，它可以找到两个字符串共享的所有n个单词长度的子字符串的长度（可能最长），而不考虑后面的标点符号。给出两个字符串： <blockquote> "this is a sample string" "this is also a sample string" </blockquote> 我想让脚本标识这些字符串有一个共有2个单词的序列（“this is”），后跟一个3个单词的序列（“示例字符串”）。以下是我目前的方法： <pre><code>a = "this is a sample string" b = "this is also a sample string" aWords = a.split() bWords = b.split() #create counters to keep track of position in string currentA = 0 currentB = 0 #create counter to keep track of longest sequence of matching words matchStreak = 0 #create a list that contains all of the matchstreaks found matchStreakList = [] #create binary switch to control the use of while loop continueWhileLoop = 1 for word in aWords: currentA += 1 if word == bWords[currentB]: matchStreak += 1 #to avoid index errors, check to make sure we can move forward one unit in the b string before doing so if currentB + 1 < len(bWords): currentB += 1 #in case we have two identical strings, check to see if we're at the end of string a. If we are, <a href="https://www.cnpython.com/list/append" class="inner-link">append</a> value of match streak to list of match streaks if currentA == len(aWords): matchStreakList.append(matchStreak) elif word != bWords[currentB]: #because the streak is broken, check to see if the streak is >= 1. If it is, append the streak counter to out list of streaks and then reset the counter if matchStreak >= 1: matchStreakList.append(matchStreak) matchStreak = 0 while word != bWords[currentB]: #the two words don't match. If you can move b forward one word, do so, then check for another match if currentB + 1 < len(bWords): currentB += 1 #if you have advanced b all the way to the end of string b, then rewind to the beginning of string b and advance a, looking for more matches elif currentB + 1 == len(bWords): currentB = 0 break if word == bWords[currentB]: matchStreak += 1 #now that you have a match, check to see if you can advance b. If you can, do so. Else, rewind b to the beginning if currentB + 1 < len(bWords): currentB += 1 elif currentB + 1 == len(bWords): #we're at the end of string b. If we are also at the end of string a, check to see if the value of matchStreak >= 1. If so, add matchStreak to matchStreakList if currentA == len(aWords): matchStreakList.append(matchStreak) currentB = 0 break print matchStreakList </code></pre> 这个脚本正确地输出了公共字长子串（2，3）的（最大）长度，目前为止所有的测试都是这样。我的问题是：有没有一对两个字符串，上面的方法不起作用？更重要的是：有没有现存的Python库或著名的方法可以用来找到两个字符串共享的所有n个字长的子字符串的最大长度？在 [这个问题不同于最长公共子串问题，它只是我所要寻找的东西的一个特例（因为我想要找到所有的公共子串，而不仅仅是最长的公共子串）。<a href="https://stackoverflow.com/questions/1410822/how-can-i-detect-common-substrings-in-a-list-of-strings">This SO post</a>建议1）聚类分析、2）编辑距离例程和3）最长公共序列算法等方法可能是合适的方法，但我没有找到任何有效的解决方案，而且我的问题可能比链接中提到的稍微简单一些，因为我处理的是空格限制的单词。] 编辑： 我开始悬赏这个问题。如果这会对其他人有所帮助，我想快速澄清几点。首先，@DhruvPathak下面给出的有用答案并没有找到两个字符串共享的所有最大长度为n个单词的子字符串。例如，假设我们分析的两个字符串是： <blockquote> "They all are white a sheet of spotless paper when they first are born but they are to be scrawled upon and blotted by every goose quill" </blockquote> 以及 <blockquote> "You are all white, a sheet of lovely, spotless paper, when you first are born; but you are to be scrawled and blotted by every goose's quill" </blockquote> 在这种情况下，最长的n个单词长度的子字符串列表（忽略后面的标点符号）是： ^{pr2}$ 使用以下例程： <pre><code>#import required packages import difflib #define function we'll use to identify matches def matches(first_string,second_string): s = difflib.SequenceMatcher(None, first_string,second_string) match = [first_string[i:i+n] for i, j, n in s.get_matching_blocks() if n > 0] return match a = "They all are white a sheet of spotless paper when they first are born but they are to be scrawled upon and blotted by every goose quill" b = "You are all white, a sheet of lovely, spotless paper, when you first are born; but you are to be scrawled and blotted by every goose's quill" a = a.replace(",", "").replace(":","").replace("!","").replace("'","").replace(";","").lower() b = b.replace(",", "").replace(":","").replace("!","").replace("'","").replace(";","").lower() print matches(a,b) </code></pre> 一个得到输出： <pre><code>['e', ' all', ' white a sheet of', ' spotless paper when ', 'y', ' first are born but ', 'y', ' are to be scrawled', ' and blotted by every goose', ' quill'] </code></pre> 首先，我不确定如何从这个列表中选择只包含整词的子字符串。第二，该列表不包括“are”，即期望的最长公共n字长度子串之一。有没有一种方法可以找到这两个字符串共享的所有最长n字长的子字符串（“You all…”和“They all…”）？在

0 条评论
分类：Python问答

默认排序时间排序

1 个回答

匿名 1天前

　擅长：python、mysql、java

找到两个字符串共同拥有的所有长度为n的子字符串的最大长度

1 个回答

相关Python问题