我一直在尝试应用fuzzyfuzzy包来解决一个查找欺诈条目的问题。如何在下面的问题中应用相同的方法?

2024-10-02 22:29:21 发布

您现在位置:Python中文网/ 问答频道 /正文

我想使用下表中的fuzzyfuzzy包

x   Reference   amount
121 TOR1234        500
121 T0R1234        500
121 W7QWER         500
121 W1QWER         500
141 TRYCATC        700
141 TRYCATC        700
151 I678MKV        300
151 1678MKV        300
  1. 我想对列“x”和“amount”匹配的表进行分组
  2. 对于组中的每个引用 一。将(fuzzyfuzzy)与该组中的其他引用进行比较。 答。如果匹配为100%,则删除它们 b。如果匹配的是90-99.99%,保留它们 c。删除任何低于90%匹配该特定行的内容 预期产出-
 x   y     amount
151 I678MKV 300
151 1678MKV 300
121 TOR1234 500
121 T0R1234 500
121 W7QWER  500
121 W1QWER  500

这是为了检测欺诈条目,如在表中,“1”替换为“I”,而“0”替换为“O”。如果您有其他的解决方案,请建议


Tags: 内容条目解决方案amount建议referencefuzzyfuzzyw1qwer
1条回答
网友
1楼 · 发布于 2024-10-02 22:29:21

据我所知,您不需要fuzzywuzzy包方法 使用简单的^{}with keep=False

df = pd.DataFrame(data={"x":[121,121,121,121,141,141,151,151],
                   "Refrence":["TOR1234","T0R1234","W7QWER","W1QWER","TRYCATC","TRYCATC"
                               ,"I678MKV","1678MKV"],
                   "amount":[500,500,500,500,700,700,300,300]})
res = df.drop_duplicates(['x','Refrence','amount'],keep=False).sort_values(['x'],ascending=[False])

print(res)
     x Refrence  amount
6  151  I678MKV     300
7  151  1678MKV     300
0  121  TOR1234     500
1  121  T0R1234     500
2  121   W7QWER     500
3  121   W1QWER     500

在同一x范围内的参考上应用levenshtein距离

from itertools import combinations
from similarity.damerau import Damerau
levenshtien = Damerau()

data = list(combinations(res['Refrence'], 2))

refrence_df = pd.DataFrame(data,columns=['Refrence','Refrence2'])

refrence_df = pd.merge(refrence_df,df[['x','Refrence']],on=['Refrence'],how='left')
refrence_df = pd.merge(refrence_df,df[['x','Refrence']],left_on=['Refrence2'],right_on=['Refrence'],how='left')

refrence_df.rename(columns={'x_x':'x_1','x_y':'x_2','Refrence_x':'Refrence'},inplace=True)

refrence_df.drop(['Refrence_y'],axis=1,inplace=True)

refrence_df = refrence_df[refrence_df['x_1']==refrence_df['x_2']]

refrence_df['edit_required'] = refrence_df.apply(lambda x: levenshtien.distance(x['Refrence'],x['Refrence2']),
                                                   axis=1)

refrence_df['characters_not_common'] = refrence_df.apply(lambda x :list(set(x['Refrence'])-set(x['Refrence2'])),axis=1)
print(refrence_df)
    Refrence Refrence2  x_1  x_2  edit_required characters_not_common
0   I678MKV   1678MKV  151  151              1                   [I]
9   TOR1234   T0R1234  121  121              1                   [O]
10  TOR1234    W7QWER  121  121              7    [O, T, 1, 3, 2, 4]
11  TOR1234    W1QWER  121  121              7       [O, T, 3, 2, 4]
12  T0R1234    W7QWER  121  121              7    [T, 1, 0, 3, 2, 4]
13  T0R1234    W1QWER  121  121              7       [T, 0, 3, 2, 4]
14   W7QWER    W1QWER  121  121              1                   [7]

相关问题 更多 >