我有一个熊猫数据框架,其中包括从网站上刮文章作为行。我有10万篇类似的文章。你知道吗
这是我的数据集的一点微光。你知道吗
text
0 which brings not only warmer weather but also the unsettling realization that the year is more than halfway over. So
1 which brings not only warmer weather but also the unsettling realization that the year is more than halfway over. So
2 which brings not only warmer weather but also the unsettling realization that the year is more than halfway over. So
3 which brings not only warmer weather but also the unsettling realization that the year is more than halfway over. So
4 which brings not only warmer weather but also the unsettling realization that the year is more than halfway over. So
5 which brings not only warmer weather but also the unsettling realization that the year is more than halfway over. So
6 which brings not only warmer weather but also the unsettling realization that the year is more than halfway over. So
7 which brings not only warmer weather but also the unsettling realization that the year is more than halfway over. So
8 which brings not only warmer weather but also the unsettling realization that the year is more than halfway over. So
for those who werent as productive as they would have liked during the first half of 2018
28 for those who werent as productive as they would have liked during the first half of 2018
29 for those who werent as productive as they would have liked during the first half of 2018
30 for those who werent as productive as they would have liked during the first half of 2018
31 for those who werent as productive as they would have liked during the first half of 2018
32 for those who werent as productive as they would have liked during the first half of 2018
现在,这些是每一篇课文的首字母,它们是重复的。正文在这些课文之后。你知道吗
有没有什么方法或功能可以识别这些文本并用几行代码将它们刷出来。你知道吗
如果要删除完全相同的字符串,请对数据帧进行排序,然后按顺序进行检查。(这与纳德里戈在评论中提到的类似。)
如果你想删除那些非常相似但不完全相同的句子,问题就更难了,也没有简单的解决办法。您需要研究局部敏感哈希或近重复检测。datasketch库可能会有所帮助。他说
根据你的评论,我想我终于明白了-你想删除一个公共前缀。在这种情况下,修改上述代码如下:
我想你可以用
difflib
,例如:输出:
这意味着字符串从cd3开始匹配。他说
现在如果你这样做了:
ouptu将是:
如果你选择做
c = s.find_longest_match(0,len(a),0,len(b))
,那么c[0] = 0
,c[1] = 0
和c[2] = 28
。他说您可以在此处阅读更多信息: https://docs.python.org/2/library/difflib.html
相关问题 更多 >
编程相关推荐