是否从_产品中创建多索引并附加列?

2024-09-26 04:55:48 发布

您现在位置:Python中文网/ 问答频道 /正文

我有以下资料:

输入df-

fruit  uniqueid 
apple   1123
appless 321
banana  623
mango   739
mangos  889

代码-

df.loc[:,'fruit_copy'] = df['fruit']
## comparing values from one column to each other
compare = pd.MultiIndex.from_product([df['fruit'],df['fruit_copy']]).to_series()

def metrics(tup):
    return pd.Series([fuzz.ratio(*tup),
                      fuzz.token_sort_ratio(*tup)],
                     ['ratio', 'token'])

compare = compare.apply(metrics)
## only keep higher matches
compare_80 = compare[(compare['ratio'] >=80) & (compare['token'] >=80)]

电流输出-

                ratio   token
apple   apple   100     100
        appless 83      83
appless apple   83      83
        appless 100     100
banana  banana  100     100
mango   mango   100     100
        mangos  91      91
mangos  mango   91      91
        mangos  100     100

预期成果第一目标-

        index1  index2          ratio token uniqueid 
        apple   1123   apple    100   100   1123  
                       appless  83    83    321
        appless 321    apple    83    83    1123
                       appless  100   100   321
        banana  623    banana   100   100   632
        mango   739    mango    100   100   739
                       mangos   91    91    889
        mangos  889    mango    91    91    739
                       mangos   100   100   889

预期成果第二个目标-

        index1  index2          ratio token uniqueid 
        apple   1123   appless  83    83    321  
        mango   739    mangos   91    91    889
        

我可以通过将uniqueid附加到多值索引来实现这一点吗


Tags: tofromtokenappledfcomparebananacopy
1条回答
网友
1楼 · 发布于 2024-09-26 04:55:48

您可以稍后尝试通过交叉合并和应用模糊比率来执行此操作:

s = df['fruit'].str[:2] #if you know how many start char should atleast match(assume 2)

u = df.assign(k=1,s=s).merge(df.drop('uniqueid',1).assign(k=1,s=s)
    ,on=['k','s'],suffixes=('','_y')).drop(['k','s'],1)

u = u[u['fruit'].ne(u['fruit_y'])].copy() #removing same combinations

u = (u.assign(Ratio=[fuzz.ratio(*i) for i in zip(u['fruit'],u['fruit_y'])])
       .sort_values('Ratio',ascending=False).drop_duplicates('fruit')).sort_index()

out = (u[pd.DataFrame(np.sort(u[['fruit','fruit_y']],axis=1),index=u.index)
      .duplicated(keep='last')])

print(out)

   fruit  uniqueid  fruit_y  Ratio
1  apple      1123  appless     83
6  mango       739   mangos     91

相关问题 更多 >