通过流行的算法如simhash、spotsig、shingling等删除重复的文档。
deduplication的Python项目详细描述
重复数据消除
通过流行的算法(如simhash、spotsig、shingling等)删除重复的文档。
安装
运行以下命令:
# install current library
pip install deduplication
# install required pretrained NLP models
python -m spacy download xx_ent_wiki_sm
python -m spacy download en_core_web_sm
示例
simhash
fromdeduplicationimportsimhashhashvalue1=simhash('this is text')hashvalue2=simhash('this is another text',n_block=4)
l-simhash
fromdeduplicationimportlsimhashhashvalue=lsimhash('this is very long article texts. maybe with a lot of sentences.')
引文
simhash
Sadowski C, Levin G.
Simhash: Hash-based similarity detection[J].
Technical report, Google, 2007.