我有一些包含拼写错误的数据。我正在更正它们,并使用以下代码对拼写的接近程度进行评分:
import pandas as pd
import difflib
Li_A = ["potato", "tomato", "squash", "apple", "pear"]
Q = {'one' : pd.Series(["potat0", "toma3o", "s5uash", "ap8le", "pea7"], index=['a', 'b', 'c', 'd', 'e']),
'two' : pd.Series(["po1ato", "2omato", "squ0sh", "2pple", "p3ar"], index=['a', 'b', 'c', 'd', 'e'])}
df_Q = pd.DataFrame(Q)
# Define the function that Corrects & Scores the Spelling
def Spelling(ask):
a = difflib.get_close_matches(ask, Li_A, n=5, cutoff=0.1)
# List comprehension for all values of a
b = [difflib.SequenceMatcher(None, ask, x).ratio() for x in a]
return pd.Series(a + b)
# Apply the function that Corrects & Scores the Spelling
df_A = df_Q['one'].apply(Spelling)
# Get the column names on the A dataframe
c = len(df_A.columns) // 2
df_A.columns = ['Spelling_{}'.format(x) for x in range(c)] + \
['Score_{}'.format(y) for y in range(c)]
# Join the Q & A dataframes
df_QA = df_Q.join(df_A)
结果如下:
df_QA
one two Spelling_0 Spelling_1 Spelling_2 Spelling_3 Spelling_4 \
a potat0 po1ato potato tomato pear apple squash
b toma3o 2omato tomato potato pear apple squash
c s5uash squ0sh squash pear apple tomato potato
d ap8le 2pple apple pear tomato squash potato
e pea7 p3ar pear potato apple tomato squash
Score_0 Score_1 Score_2 Score_3 Score_4
a 0.833333 0.500000 0.400000 0.181818 0.166667
b 0.833333 0.333333 0.200000 0.181818 0.166667
c 0.833333 0.200000 0.181818 0.166667 0.166667
d 0.800000 0.222222 0.181818 0.181818 0.181818
e 0.750000 0.400000 0.444444 0.200000 0.200000
对于“e”行,“土豆”在第1行,“苹果”在第2行。不过,苹果的得分比土豆高。这对我的申请来说是错误的。你知道吗
我怎样才能得到更高的得分结果?请一直向左?你知道吗
编辑1:我尝试了一个更简单的代码:
import difflib
Li_A = ["potato", "tomato", "squash", "apple", "pear"]
Q = "pea7"
A = difflib.get_close_matches(Q, Li_A, n=5, cutoff=0.1)
得到相同的结果(&G):
A: ['pear', 'potato', 'apple', 'tomato', 'squash']
我还尝试了一个更简单的评分代码:
import difflib
S1 = difflib.SequenceMatcher(None, "pea7", "potato")
R1 = S1.ratio()
S2 = difflib.SequenceMatcher(None, "pea7", "apple")
R2 = S2.ratio()
再次得到相同的结果(&A):
R1: 0.4
R2: 0.444
编辑2我用fuzzyfuzzy试过了。因为fuzzyfuzzy依赖于difflib,所以我又得到了相同的结果:
from fuzzywuzzy import fuzz
R1 = fuzz.ratio("pea7", "potato")
R2 = fuzz.ratio("pea7", "apple")
SequenceMatcher使用Ratcliff和Metzener(1988)描述的方法正确计算比率。即,对于公共字符数(CC)和两个字符串中的字符总数(CT):
看来问题出在“接近”匹配上了
相关问题 更多 >
编程相关推荐