从pandas行中删除多个重复出现的文本`

text 0 which brings not only warmer weather but also the unsettling realization that the year is more than halfway over. So 1 which brings not only warmer weather but also the unsettling realization that the year is more than halfway over. So 2 which brings not only warmer weather but also the unsettling realization that the year is more than halfway over. So 3 which brings not only warmer weather but also the unsettling realization that the year is more than halfway over. So 4 which brings not only warmer weather but also the unsettling realization that the year is more than halfway over. So 5 which brings not only warmer weather but also the unsettling realization that the year is more than halfway over. So 6 which brings not only warmer weather but also the unsettling realization that the year is more than halfway over. So 7 which brings not only warmer weather but also the unsettling realization that the year is more than halfway over. So 8 which brings not only warmer weather but also the unsettling realization that the year is more than halfway over. So for those who werent as productive as they would have liked during the first half of 2018 28 for those who werent as productive as they would have liked during the first half of 2018 29 for those who werent as productive as they would have liked during the first half of 2018 30 for those who werent as productive as they would have liked during the first half of 2018 31 for those who werent as productive as they would have liked during the first half of 2018 32 for those who werent as productive as they would have liked during the first half of 2018

2条回答

网友

1楼 · 编辑于 2024-09-30 04:38:46

如果要删除完全相同的字符串，请对数据帧进行排序，然后按顺序进行检查。（这与纳德里戈在评论中提到的类似。）

sents = ... # sorted dataframe
out = [] # stuff here will be unique
for ii in range(len(sents) - 1):
    if sents[ii] != sents[ii + 1]:
        out.append(sents[ii])

如果你想删除那些非常相似但不完全相同的句子，问题就更难了，也没有简单的解决办法。您需要研究局部敏感哈希或近重复检测。datasketch库可能会有所帮助。他说

根据你的评论，我想我终于明白了-你想删除一个公共前缀。在这种情况下，修改上述代码如下：

sents = ... # sorted dataframe
out = [] # cleaned sentences go here
lml = -1 # last match length
for ii in range(len(sents) - 1):
    # first check if the match from the last iteration still works
    if sents[ii][:lml] == sents[ii+1][:lml] and sents[ii][:lml + 1] != sents[ii+1][:lml + 1]:
        # old prefix still worked, chop and move on
        out.append(sents[ii][lml:])
        continue

    # if we're here, it means the prefix changed
    ml = 1 # match length
    # find the longest matching prefix
    while sents[ii][:ml] == sents[ii+1][:ml]:
        ml += 1

    # save the prefix length
    lml = ml
    # chop off the shared prefix
    out.append(sents[ii][ml:])

网友
2楼 · 编辑于 2024-09-30 04:38:46

我想你可以用difflib，例如：
>>> import difflib >>> a = "my mother always told me to mind my business" >>> b = "my mother always told me to be polite" >>> s = difflib.SequenceMatcher(None,a,b) >>> s.find_longest_match(0,len(a),0,len(b))
输出：
Match(a=0, b=0, size=28)
这意味着字符串从cd3开始匹配。他说
现在如果你这样做了：
>>> b.replace(a[:28],"")
ouptu将是：
'be polite'
如果你选择做c = s.find_longest_match(0,len(a),0,len(b))，那么c[0] = 0，c[1] = 0和c[2] = 28。他说
您可以在此处阅读更多信息： https://docs.python.org/2/library/difflib.html

相关问题更多 >

编程相关推荐

热门问题

热门文章