<p>如果要删除完全相同的字符串,请对数据帧进行排序,然后按顺序进行检查。(这与纳德里戈在评论中提到的类似。)</p>
<pre><code>sents = ... # sorted dataframe
out = [] # stuff here will be unique
for ii in range(len(sents) - 1):
if sents[ii] != sents[ii + 1]:
out.append(sents[ii])
</code></pre>
<p>如果你想删除那些非常相似但不完全相同的句子,问题就更难了,也没有简单的解决办法。您需要研究<strong>局部敏感哈希</strong>或<strong>近重复检测</strong>。<a href="https://ekzhu.github.io/datasketch/" rel="nofollow noreferrer">datasketch</a>库可能会有所帮助。他说</p>
<hr/>
<p>根据你的评论,我想我终于明白了-你想删除一个<strong>公共前缀<strong>。在这种情况下,修改上述代码如下:</p>
<pre><code>sents = ... # sorted dataframe
out = [] # cleaned sentences go here
lml = -1 # last match length
for ii in range(len(sents) - 1):
# first check if the match from the last iteration still works
if sents[ii][:lml] == sents[ii+1][:lml] and sents[ii][:lml + 1] != sents[ii+1][:lml + 1]:
# old prefix still worked, chop and move on
out.append(sents[ii][lml:])
continue
# if we're here, it means the prefix changed
ml = 1 # match length
# find the longest matching prefix
while sents[ii][:ml] == sents[ii+1][:ml]:
ml += 1
# save the prefix length
lml = ml
# chop off the shared prefix
out.append(sents[ii][ml:])
</code></pre>