回答此问题可获得 20 贡献值,回答如果被采纳可获得 50 分。
<p>我正在开发一个Python脚本,它可以找到两个字符串共享的所有n个单词长度的子字符串的长度(可能最长),而不考虑后面的标点符号。给出两个字符串:</p>
<blockquote>
<p>"this is a sample string"</p>
<p>"this is also a sample string"</p>
</blockquote>
<p>我想让脚本标识这些字符串有一个共有2个单词的序列(“this is”),后跟一个3个单词的序列(“示例字符串”)。以下是我目前的方法:</p>
<pre><code>a = "this is a sample string"
b = "this is also a sample string"
aWords = a.split()
bWords = b.split()
#create counters to keep track of position in string
currentA = 0
currentB = 0
#create counter to keep track of longest sequence of matching words
matchStreak = 0
#create a list that contains all of the matchstreaks found
matchStreakList = []
#create binary switch to control the use of while loop
continueWhileLoop = 1
for word in aWords:
currentA += 1
if word == bWords[currentB]:
matchStreak += 1
#to avoid index errors, check to make sure we can move forward one unit in the b string before doing so
if currentB + 1 < len(bWords):
currentB += 1
#in case we have two identical strings, check to see if we're at the end of string a. If we are, <a href="https://www.cnpython.com/list/append" class="inner-link">append</a> value of match streak to list of match streaks
if currentA == len(aWords):
matchStreakList.append(matchStreak)
elif word != bWords[currentB]:
#because the streak is broken, check to see if the streak is >= 1. If it is, append the streak counter to out list of streaks and then reset the counter
if matchStreak >= 1:
matchStreakList.append(matchStreak)
matchStreak = 0
while word != bWords[currentB]:
#the two words don't match. If you can move b forward one word, do so, then check for another match
if currentB + 1 < len(bWords):
currentB += 1
#if you have advanced b all the way to the end of string b, then rewind to the beginning of string b and advance a, looking for more matches
elif currentB + 1 == len(bWords):
currentB = 0
break
if word == bWords[currentB]:
matchStreak += 1
#now that you have a match, check to see if you can advance b. If you can, do so. Else, rewind b to the beginning
if currentB + 1 < len(bWords):
currentB += 1
elif currentB + 1 == len(bWords):
#we're at the end of string b. If we are also at the end of string a, check to see if the value of matchStreak >= 1. If so, add matchStreak to matchStreakList
if currentA == len(aWords):
matchStreakList.append(matchStreak)
currentB = 0
break
print matchStreakList
</code></pre>
<p>这个脚本正确地输出了公共字长子串(2,3)的(最大)长度,目前为止所有的测试都是这样。我的问题是:有没有一对两个字符串,上面的方法不起作用?更重要的是:有没有现存的Python库或著名的方法可以用来找到两个字符串共享的所有n个字长的子字符串的最大长度?在</p>
<p>[这个问题不同于最长公共子串问题,它只是我所要寻找的东西的一个特例(因为我想要找到所有的公共子串,而不仅仅是最长的公共子串)。<a href="https://stackoverflow.com/questions/1410822/how-can-i-detect-common-substrings-in-a-list-of-strings">This SO post</a>建议1)聚类分析、2)编辑距离例程和3)最长公共序列算法等方法可能是合适的方法,但我没有找到任何有效的解决方案,而且我的问题可能比链接中提到的稍微简单一些,因为我处理的是空格限制的单词。]</p>
<p><strong>编辑:</strong></p>
<p>我开始悬赏这个问题。如果这会对其他人有所帮助,我想快速澄清几点。首先,@DhruvPathak下面给出的有用答案并没有找到两个字符串共享的所有最大长度为n个单词的子字符串。例如,假设我们分析的两个字符串是:</p>
<blockquote>
<p>"They all are white a sheet of spotless paper when they first are born
but they are to be scrawled upon and blotted by every goose quill"</p>
</blockquote>
<p>以及</p>
<blockquote>
<p>"You are all white, a sheet of lovely, spotless paper, when you first
are born; but you are to be scrawled and blotted by every goose's
quill"</p>
</blockquote>
<p>在这种情况下,最长的n个单词长度的子字符串列表(忽略后面的标点符号)是:</p>
^{pr2}$
<p>使用以下例程:</p>
<pre><code>#import required packages
import difflib
#define function we'll use to identify matches
def matches(first_string,second_string):
s = difflib.SequenceMatcher(None, first_string,second_string)
match = [first_string[i:i+n] for i, j, n in s.get_matching_blocks() if n > 0]
return match
a = "They all are white a sheet of spotless paper when they first are born but they are to be scrawled upon and blotted by every goose quill"
b = "You are all white, a sheet of lovely, spotless paper, when you first are born; but you are to be scrawled and blotted by every goose's quill"
a = a.replace(",", "").replace(":","").replace("!","").replace("'","").replace(";","").lower()
b = b.replace(",", "").replace(":","").replace("!","").replace("'","").replace(";","").lower()
print matches(a,b)
</code></pre>
<p>一个得到输出:</p>
<pre><code>['e', ' all', ' white a sheet of', ' spotless paper when ', 'y', ' first are born but ', 'y', ' are to be scrawled', ' and blotted by every goose', ' quill']
</code></pre>
<p>首先,我不确定如何从这个列表中选择只包含整词的子字符串。第二,该列表不包括“are”,即期望的最长公共n字长度子串之一。有没有一种方法可以找到这两个字符串共享的所有最长n字长的子字符串(“You all…”和“They all…”)?在</p>