模糊模糊过程的奇怪结果

>>> schList =["Diocesan Boy's School", "Diocesan Girl's School", 'Heep Yunn School', 'La Salle College', 'Maryknoll Convent School', 'Marymount Secondary School', 'Methodist College', 'Sacred Heart Canossian College', "St Clare's Girl's School", 'St Francis Canossian College', "St Joseph's College", "St Mark's School", "St Mary's Canossian College", "St Paul's Co-educational College", "St Paul's College", "St Paul's Convent School", "St Paul's Secondary School", "St Stephen's Girl's College", 'Wah Yan College, Hong Kong', 'Wah Yan College, Kowloon', 'Ying Wa College', "Ying Wa Girl's School"] >>> ans= process.extractBests("St. Paul",schList) >>> ans [("St Clare's Girl's School", 86), ('St Francis Canossian College', 86), ("St Joseph's College", 86), ("St Mark's School", 86), ("St Mary's Canossian College", 86)]

1条回答

网友

1楼 · 发布于 2024-10-03 17:27:48

我认为从fuzzywazzy所做的事情开始是很好的。它使用Levenshtein距离来计算序列之间的差异。如果你把那是什么循环起来，你可能会得到：

the Levenshtein distance is a string metric for measuring the difference between two sequences. Informally, the Levenshtein distance between two words is the minimum number of single-character edits (insertions, deletions or substitutions) required to change one word into the other.

换句话说，您的结果是最容易使它们看起来像St Paul的字符串

另外，你说你在寻找St Paul，但是在代码中你有St. Paul和.。这就不同了

例如：

from fuzzywuzzy import process

schools = ["Diocesan Boy's School", "Diocesan Girl's School", 'Heep Yunn School', 'La Salle College', 'Maryknoll Convent School', 'Marymount Secondary School', 'Methodist College', 'Sacred Heart Canossian College', "St Clare's Girl's School", 'St Francis Canossian College', "St Joseph's College", "St Mark's School", "St Mary's Canossian College", "St Paul's Co-educational College", "St Paul's College", "St Paul's Convent School", "St Paul's Secondary School", "St Stephen's Girl's College", 'Wah Yan College, Hong Kong', 'Wah Yan College, Kowloon', 'Ying Wa College', "Ying Wa Girl's School"]
print(process.extractBests("St Paul", schools))

给你这个：

[("St Paul's Co-educational College", 90), ("St Paul's College", 90), ("St Paul's Convent School", 90), ("St Paul's Secondary School", 90), ("St Clare's Girl's School", 86)]

另一方面，如果要查找包含另一个字符串的字符串，为什么不这样做呢

schools = ["Diocesan Boy's School", "Diocesan Girl's School", 'Heep Yunn School', 'La Salle College', 'Maryknoll Convent School', 'Marymount Secondary School', 'Methodist College', 'Sacred Heart Canossian College', "St Clare's Girl's School", 'St Francis Canossian College', "St Joseph's College", "St Mark's School", "St Mary's Canossian College", "St Paul's Co-educational College", "St Paul's College", "St Paul's Convent School", "St Paul's Secondary School", "St Stephen's Girl's College", 'Wah Yan College, Hong Kong', 'Wah Yan College, Kowloon', 'Ying Wa College', "Ying Wa Girl's School"]
print([s for s in schools if "St Paul" in s])

输出：["St Paul's Co-educational College", "St Paul's College", "St Paul's Convent School", "St Paul's Secondary School"]

相关问题更多 >

编程相关推荐

热门问题

热门文章