从pandas行中删除多个重复出现的文本`

2024-09-30 04:38:46 发布

您现在位置:Python中文网/ 问答频道 /正文

我有一个熊猫数据框架,其中包括从网站上刮文章作为行。我有10万篇类似的文章。你知道吗

这是我的数据集的一点微光。你知道吗

text
0   which brings not only warmer weather but also the unsettling realization that the year is more than halfway over. So
1   which brings not only warmer weather but also the unsettling realization that the year is more than halfway over. So
2   which brings not only warmer weather but also the unsettling realization that the year is more than halfway over. So
3   which brings not only warmer weather but also the unsettling realization that the year is more than halfway over. So
4   which brings not only warmer weather but also the unsettling realization that the year is more than halfway over. So
5   which brings not only warmer weather but also the unsettling realization that the year is more than halfway over. So
6   which brings not only warmer weather but also the unsettling realization that the year is more than halfway over. So
7   which brings not only warmer weather but also the unsettling realization that the year is more than halfway over. So
8   which brings not only warmer weather but also the unsettling realization that the year is more than halfway over. So
for those who werent as productive as they would have liked during the first half of 2018
28  for those who werent as productive as they would have liked during the first half of 2018
29  for those who werent as productive as they would have liked during the first half of 2018
30  for those who werent as productive as they would have liked during the first half of 2018
31  for those who werent as productive as they would have liked during the first half of 2018
32  for those who werent as productive as they would have liked during the first half of 2018

现在,这些是每一篇课文的首字母,它们是重复的。正文在这些课文之后。你知道吗

有没有什么方法或功能可以识别这些文本并用几行代码将它们刷出来。你知道吗


Tags: theonlywhichthatisasnotyear
2条回答

如果要删除完全相同的字符串,请对数据帧进行排序,然后按顺序进行检查。(这与纳德里戈在评论中提到的类似。)

sents = ... # sorted dataframe
out = [] # stuff here will be unique
for ii in range(len(sents) - 1):
    if sents[ii] != sents[ii + 1]:
        out.append(sents[ii])

如果你想删除那些非常相似但不完全相同的句子,问题就更难了,也没有简单的解决办法。您需要研究局部敏感哈希近重复检测datasketch库可能会有所帮助。他说


根据你的评论,我想我终于明白了-你想删除一个公共前缀。在这种情况下,修改上述代码如下:

sents = ... # sorted dataframe
out = [] # cleaned sentences go here
lml = -1 # last match length
for ii in range(len(sents) - 1):
    # first check if the match from the last iteration still works
    if sents[ii][:lml] == sents[ii+1][:lml] and sents[ii][:lml + 1] != sents[ii+1][:lml + 1]:
        # old prefix still worked, chop and move on
        out.append(sents[ii][lml:])
        continue

    # if we're here, it means the prefix changed
    ml = 1 # match length
    # find the longest matching prefix
    while sents[ii][:ml] == sents[ii+1][:ml]:
        ml += 1

    # save the prefix length
    lml = ml
    # chop off the shared prefix
    out.append(sents[ii][ml:])

我想你可以用difflib,例如:

>>> import difflib
>>> a = "my mother always told me to mind my business" 
>>> b = "my mother always told me to be polite"
>>> s = difflib.SequenceMatcher(None,a,b)
>>> s.find_longest_match(0,len(a),0,len(b))

输出:

Match(a=0, b=0, size=28)

这意味着字符串从cd3开始匹配。他说

现在如果你这样做了:

>>> b.replace(a[:28],"")

ouptu将是:

'be polite'

如果你选择做c = s.find_longest_match(0,len(a),0,len(b)),那么c[0] = 0c[1] = 0c[2] = 28。他说

您可以在此处阅读更多信息: https://docs.python.org/2/library/difflib.html

相关问题 更多 >

    热门问题