有没有一种方法可以更有效地在python中编写这篇文章,从而减少带有产品范围计算阈值的ifelif的运行时

2024-10-02 12:22:52 发布

您现在位置:Python中文网/ 问答频道 /正文

我通过确定所有可能的阈值组合来计算准确度得分,然后使用这些阈值返回匹配的名称,然后对这些阈值组合中的每一个进行评分,以查看哪一个返回最高的准确度得分。为此,我使用product(range())创建了可能的组合,然后使用if-elif语句应用这些组合,但这需要很长的时间(到目前为止,在1300行上运行一个多小时)。有更好的办法吗

以下是df的示例:

import pandas as pd
data = {'Name_Raw':['AECOM TECHNICAL SERVICES', 'AECOM_*', 'AECOM- Amentum', 'AECOM GOVERNMENT SERVICES (Inactive)', 'ADT LLC dba ADT Security Services', 'ADT', 'AAA Call Center', 'AAA of Northern California, Nevada', 'ANHEUSER BUSCH InBev'], 'Name_CleanCorrect':['AECOM', 'AECOM', 'AECOM', 'AECOM', 'ADT SECURITY CORPORATION', 'ADT SECURITY CORPORATION', 'AAA', 'AAA', 'AB InBev'], 'Name_ngram':['AECOM', 'AECOM', 'AECOM', 'AECOM', 'ADT SECURITY CORPORATION', 'ADT SECURITY CORPORATION', 'AAA', 'State Bar of California', 'Ivanhoe Cambridge USA'], 'Score_ngrams':[38, 100, 51, 33, 52, 41, 36, 30, 16], 'Name_Fuzz':['AECOM', 'AECOM', 'AECOM', 'AECOM', 'ADT SECURITY CORPORATION', 'ADT SECURITY CORPORATION', 'AAA', 'State Bar of California', 'AB InBev'], 'Score_fuzz':[100, 100, 100, 100, 65, 85, 85, 37, 65], 'Name_jw':['Chicago Title Insuranc', 'Invesco', 'Heitman', 'Patheon/Thermo Fisher Scientific', 'Securitas Security Service', 'Michael Baker International, LLC', 'Bank of America', 'Ascension Health', 'Frontier Communication'], 'Score_jw':[66, 66, 63, 61, 62, 64, 67, 32, 100]}

df2 = pd.DataFrame(data)
print(df2)

我当前的代码:

from itertools import product

def f(x, ngram_thresh, cosine_thresh, fuzz_thresh, fuzz_rat_thresh, fuzz_prat_thresh, jaro_thresh, jw_thresh, jaccard_thresh, lev_thresh):
    if x['Score_ngrams'] >= ngram_thresh : return x['Name_ngram']
    elif x['Score_cosine_words'] >= cosine_thresh : return x['Name_cosine_words']
    elif x['Score_fuzz'] >= fuzz_thresh : return x['Name_fuzz']
    elif x['Score_fuzz_ratio'] >= fuzz_rat_thresh : return x['Name_fuzz_ratio']
    elif x['Score_fuzz_pratio'] >= fuzz_prat_thresh : return x['Name_fuzz_pratio']
    elif x['Score_jaro'] >= jaro_thresh : return x['Name_jaro']
    elif x['Score_jw'] >= jw_thresh : return x['Name_jw']
    elif x['Score_jaccard'] >= jaccard_thresh : return x['Name_jaccard']
    elif x['Score_lev_r'] >= lev_thresh : return x['Name_lev_r']
    else: return 0

for ngram_t, cosine_t, fuzz_t, fuzz_rat_t, fuzz_prat_t, jaro_t, jw_t, jaccard_t, lev_t in product(range(50,110,5), repeat=9):
    df_fourth[f'Name_Clean_{ngram_t}_{cosine_t}_{fuzz_t}_{fuzz_rat_t}_{fuzz_prat_t}_{jaro_t}_{jw_t}_{jaccard_t}_{lev_t}'] = df_fourth.apply(f, ngram_thresh=ngram_t, cosine_thresh=cosine_t, fuzz_thresh=fuzz_t, fuzz_rat_thresh=fuzz_rat_t, fuzz_prat_thresh=fuzz_prat_t, jaro_thresh=jaro_t, jw_thresh=jw_t, jaccard_thresh=jaccard_t, lev_thresh=lev_t, axis=1)

Tags: namereturnscorejwelifcosinejarojaccard

热门问题