如何在文本文件中使用levenshtein距离查找与另一个句子相似的句子开头？

2024-04-27 03:43:01 发布

您现在位置：Python中文网/ 问答频道 /正文

7053

网友

男 | 程序猿一只，喜欢编程写python代码。

我需要在文本文件中找到所有句子的开头，但问题是，我在文件中查找的句子可能与我在数组中的句子有一些不同

我想用levenshtein距离来比较句子，问题是我该拿什么来比较？文件很大，句子最多只有一行

到目前为止，这是我的代码，没有任何相似距离的简单比较

import re
import pandas as pd

data = pd.read_excel("./excel_file_with_the_sentences.xlsx")
df = pd.DataFrame(data, columns=['Année', 'Journal', 'A_Sommaire', 'Numero'])
# print(df)

jo = df.query("Année == 2018")
jo.sort_values(by=['Numero'], inplace=True)
# "A_Sommaire" contains the sentences the other fields are there to filter and sort only
print(jo["A_Sommaire"])
print(len(jo))
#################################################################################

file_path = "./the_file_with_the_text.txt"

file = open(file_path)
txt = file.read()
##################################################################################

titles = [t for t in jo["A_Sommaire"]]
print(titles)
beginnings = []
for title in titles:
    # here I get the iterator that point to the first title encontred
    # and I want to change it so that it can search for the first "similar"
    # title or sentence
    beginning = re.finditer(title, txt, flags=re.MULTILINE)
    beginnings.append([b.start() for b in beginning])

print(beginnings)

结果是：

[[], [], [], [], [], [13898], [], [17136], [17645], [18743], [19886], [21010], [22165], [], [], [], [26885], [], [31049], [33333], [35260], [37339], [39760], [41822], [], [45880], [], [], [], [54839], []]

这是不完整的，通常不存在空值，因为Excel文件中的每个句子都应该在文本文件中至少出现一次

所以我的问题是，我怎样才能用levenshtein距离或任何其他方法来确定相似度，在文本文件中得到我所有句子的开头

注意这些文件太大了，甚至无法尝试将一部分作为示例，因此我对此感到抱歉

Tags：文件 the to re 距离 df for title

0条回答

目前没有回答

如何在文本文件中使用levenshtein距离查找与另一个句子相似的句子开头？

相关问题更多 >

编程相关推荐

热门问题

热门文章

如何在文本文件中使用levenshtein距离查找与另一个句子相似的句子开头？

相关问题 更多 >

编程相关推荐

热门问题

热门文章

相关问题更多 >