数据帧1中的多个拼写结果

2024-06-17 12:43:28 发布

您现在位置:Python中文网/ 问答频道 /正文

我有一些包含拼写错误的数据。我正在更正它们,并使用以下代码对拼写的接近程度进行评分:

 import pandas as pd
 import difflib

 Li_A = ["potato", "tomato", "squash", "apple", "pear"]

 Q    = {'one' : pd.Series(["potat0", "toma3o", "s5uash", "ap8le", "pea7"], index=['a', 'b', 'c', 'd', 'e']),
         'two' : pd.Series(["po1ato", "2omato", "squ0sh", "2pple", "p3ar"], index=['a', 'b', 'c', 'd', 'e'])}

 df_Q = pd.DataFrame(Q)

 # Define the function that Corrects & Scores the Spelling
 def Spelling(ask):
     a = difflib.get_close_matches(ask, Li_A, n=5, cutoff=0.1)

     # List comprehension for all values of a
     b = [difflib.SequenceMatcher(None, ask, x).ratio() for x in a]
     return pd.Series(a + b)

 # Apply the function that Corrects & Scores the Spelling
 df_A = df_Q['one'].apply(Spelling)

 # Get the column names on the A dataframe
 c = len(df_A.columns) // 2
 df_A.columns = ['Spelling_{}'.format(x) for x in range(c)] + \
                ['Score_{}'.format(y)    for y in range(c)]

 # Join the Q & A dataframes
 df_QA = df_Q.join(df_A)

结果如下:

 df_QA
       one     two Spelling_0 Spelling_1 Spelling_2 Spelling_3 Spelling_4  \
 a  potat0  po1ato     potato     tomato       pear      apple     squash   
 b  toma3o  2omato     tomato     potato       pear      apple     squash   
 c  s5uash  squ0sh     squash       pear      apple     tomato     potato   
 d   ap8le   2pple      apple       pear     tomato     squash     potato   
 e    pea7    p3ar       pear     potato      apple     tomato     squash   

     Score_0   Score_1   Score_2   Score_3   Score_4  
 a  0.833333  0.500000  0.400000  0.181818  0.166667  
 b  0.833333  0.333333  0.200000  0.181818  0.166667  
 c  0.833333  0.200000  0.181818  0.166667  0.166667  
 d  0.800000  0.222222  0.181818  0.181818  0.181818  
 e  0.750000  0.400000  0.444444  0.200000  0.200000  

对于“e”行,“土豆”在第1行,“苹果”在第2行。不过,苹果的得分比土豆高。这对我的申请来说是错误的。你知道吗

我怎样才能得到更高的得分结果?请一直向左?你知道吗

编辑1:我尝试了一个更简单的代码:

 import difflib
 Li_A = ["potato", "tomato", "squash", "apple", "pear"]
 Q    = "pea7"
 A = difflib.get_close_matches(Q, Li_A, n=5, cutoff=0.1)

得到相同的结果(&G):

 A: ['pear', 'potato', 'apple', 'tomato', 'squash']

我还尝试了一个更简单的评分代码:

 import difflib
 S1 = difflib.SequenceMatcher(None, "pea7", "potato")
 R1 = S1.ratio()
 S2 = difflib.SequenceMatcher(None, "pea7", "apple")
 R2 = S2.ratio()

再次得到相同的结果(&A):

 R1: 0.4
 R2: 0.444

编辑2我用fuzzyfuzzy试过了。因为fuzzyfuzzy依赖于difflib,所以我又得到了相同的结果:

 from fuzzywuzzy import fuzz
 R1 = fuzz.ratio("pea7", "potato")
 R2 = fuzz.ratio("pea7", "apple")

Tags: theimportappledflisquashpotatopd
1条回答
网友
1楼 · 发布于 2024-06-17 12:43:28

SequenceMatcher使用Ratcliff和Metzener(1988)描述的方法正确计算比率。即,对于公共字符数(CC)和两个字符串中的字符总数(CT):

ratio = 2.CC/CT 

看来问题出在“接近”匹配上了

相关问题 更多 >