模糊模糊过程的奇怪结果

2024-10-03 17:27:48 发布

您现在位置:Python中文网/ 问答频道 /正文

我有一个22个学校名称的列表,我想用process.bests搜索“圣保罗”。搜索结果令人惊讶,因为列表中有四个元素以“St Paul”开头,搜索结果没有返回任何元素,而是返回了其他元素

>>> schList =["Diocesan Boy's School", "Diocesan Girl's School", 'Heep Yunn School', 'La Salle College', 'Maryknoll Convent School', 'Marymount Secondary School', 'Methodist College', 'Sacred Heart Canossian College', "St Clare's Girl's School", 'St Francis Canossian College', "St Joseph's College", "St Mark's School", "St Mary's Canossian College", "St Paul's Co-educational College", "St Paul's College", "St Paul's Convent School", "St Paul's Secondary School", "St Stephen's Girl's College", 'Wah Yan College, Hong Kong', 'Wah Yan College, Kowloon', 'Ying Wa College', "Ying Wa Girl's School"]
>>> ans= process.extractBests("St. Paul",schList)
>>> ans
[("St Clare's Girl's School", 86), ('St Francis Canossian College', 86), ("St Joseph's College", 86), ("St Mark's School", 86), ("St Mary's Canossian College", 86)]

我想知道我们是否需要任何预处理来获得更好或更合理的结果


Tags: 元素列表processstsecondaryschoolgirlpaul
1条回答
网友
1楼 · 发布于 2024-10-03 17:27:48

我认为从fuzzywazzy所做的事情开始是很好的。它使用Levenshtein距离来计算序列之间的差异。如果你把那是什么循环起来,你可能会得到:

the Levenshtein distance is a string metric for measuring the difference between two sequences. Informally, the Levenshtein distance between two words is the minimum number of single-character edits (insertions, deletions or substitutions) required to change one word into the other.

换句话说,您的结果是最容易使它们看起来像St Paul的字符串

另外,你说你在寻找St Paul,但是在代码中你有St. Paul.。这就不同了

例如:

from fuzzywuzzy import process

schools = ["Diocesan Boy's School", "Diocesan Girl's School", 'Heep Yunn School', 'La Salle College', 'Maryknoll Convent School', 'Marymount Secondary School', 'Methodist College', 'Sacred Heart Canossian College', "St Clare's Girl's School", 'St Francis Canossian College', "St Joseph's College", "St Mark's School", "St Mary's Canossian College", "St Paul's Co-educational College", "St Paul's College", "St Paul's Convent School", "St Paul's Secondary School", "St Stephen's Girl's College", 'Wah Yan College, Hong Kong', 'Wah Yan College, Kowloon', 'Ying Wa College', "Ying Wa Girl's School"]
print(process.extractBests("St Paul", schools))

给你这个:

[("St Paul's Co-educational College", 90), ("St Paul's College", 90), ("St Paul's Convent School", 90), ("St Paul's Secondary School", 90), ("St Clare's Girl's School", 86)]

另一方面,如果要查找包含另一个字符串的字符串,为什么不这样做呢

schools = ["Diocesan Boy's School", "Diocesan Girl's School", 'Heep Yunn School', 'La Salle College', 'Maryknoll Convent School', 'Marymount Secondary School', 'Methodist College', 'Sacred Heart Canossian College', "St Clare's Girl's School", 'St Francis Canossian College', "St Joseph's College", "St Mark's School", "St Mary's Canossian College", "St Paul's Co-educational College", "St Paul's College", "St Paul's Convent School", "St Paul's Secondary School", "St Stephen's Girl's College", 'Wah Yan College, Hong Kong', 'Wah Yan College, Kowloon', 'Ying Wa College', "Ying Wa Girl's School"]
print([s for s in schools if "St Paul" in s])

输出:["St Paul's Co-educational College", "St Paul's College", "St Paul's Convent School", "St Paul's Secondary School"]

相关问题 更多 >