例如,这就是我想要输入的内容:
input = [
'<html><head><title>Albert Einstein - Minipedia</title></head><body><b>Welcome to Minipedia! You are viewing page 1</b> Albert Einstein was a scientist</body></html>',
'<html><head><title>Ludwig Van Beethoven - Minipedia</title></head><body><b>Welcome to Minipedia! You are viewing page 2</b> Ludwig van Beethoven was a Musician</body></html>',
'<html><head><title>Red - Minipedia</title></head><body><b>Welcome to Minipedia! You are viewing page 3</b> Red is a color.</body></html>'
]
我想要的输出是:
output = [
['Albert Einstein', 'Albert Einstein was a scientist'],
['Ludwig Van Beethoven', 'Ludwig Van Beethoven was a musician'],
['Red', 'Red is a color']
]
我要寻找的逻辑大致是,如果每个文档的子字符串都有明显的重叠(即足够小的编辑距离),那么应该将它们取出,并用于标记剩余的具有足够差异的字符串。你知道吗
有这方面的图书馆吗?你知道吗
目前没有回答
相关问题 更多 >
编程相关推荐