具有最高字符串相似性的行对

2024-06-26 17:49:04 发布

您现在位置:Python中文网/ 问答频道 /正文

我有一个数据框:

import pandas as pd
d = {'id': [1,1,1,1,2,2,3,3,3,4,4,4,4],
     'name':['ada','aad','ada','ada','dddd','fdd','ccc','cccd','ood','aaa','aaa','aar','rrp']
    ,'amount':[2,-12,12,-12,5,-5,2,3,-5,3,-10,10,-10]}
df1 = pd.DataFrame(d)
df1
    id  name    amount
0   1   ada      2
1   1   aad     -12
2   1   ada      12
3   1   ada    -12
4   2   dddd    5
5   2   fdd    -5
6   3   ccc     2 
7   3   cccd    3
8   3   ood    -5
9   4   aaa     3
10  4   aaa    -10
11  4   aar     10
12  4   rrp    -10 

首先,我想找到每个id的负数的匹配正数,我通过以下方法完成:

def match_pos_neg(df):
    return df[df["amount"].isin(-df["amount"])]

df1 = df1.groupby("id").apply(match_pos_neg).reset_index(0, drop=True)
df1
    id  name  amount
1   1   aad   -12
2   1   ada   12
3   1   ada   -12
4   2   dddd    5
5   2   fdd    -5
10  4   aaa   -10
11  4   aar    10
12  4   rrp   -10

下一步我要做的是只获取字符串列“name”中相似性最高的匹配pos和neg数字对。因此,如果一个id有两个与正数匹配的其他负数,我想分离每个id相似性最高的对,因此我希望所需的输出如下:

 id  name  amount
2   1   ada   12
3   1   ada   -12
4   2   dddd    5
5   2   fdd    -5
10  4   aaa   -10
11  4   aar    10

我想我必须使用某种类型的字符串相似性索引,如sequencematcher或jaccard等,但我不确定如何解决这个问题。如果您能帮助我获得所需的输出,我们将不胜感激


Tags: nameposiddf相似性amountrrpdf1
1条回答
网友
1楼 · 发布于 2024-06-26 17:49:04

您可以尝试以下方法:

请注意,您可以根据需要更改打印的信息,只需编辑函数create_sim中的返回值

import pandas as pd
from operator import itemgetter

d = {'id': [1,1,1,1,2,2,3,3,3,4,4,4,4],
     'name':['ada','aad','ada','ada','dddd','fdd','ccc','cccd','ood','aaa','aaa','aar','rrp']
    ,'amount':[2,-12,12,-12,5,-5,2,3,-5,3,-10,10,-10]}
df1 = pd.DataFrame(d)

def match_pos_neg(df):
    return df[df["amount"].isin(-df["amount"])]

df1 = df1.groupby("id").apply(match_pos_neg).reset_index(0, drop=True)

print(df1)


def split(word):
    return [char for char in word]


def DistJaccard(str1, str2):
    l1 = set(split(str1))
    l2 = set(split(str2))
    return float(len(l1 & l2)) / len(l1 | l2)


def create_sim(df, idx):
    idx_id = df['id'].values[idx]
    idx_amount = df['amount'].values[idx]
    idx_name = df['name'].values[idx]
    df_t = df.loc[df['id'] == idx_id]
    pos = [i for i in list(df_t['amount']) if i > 0] or None
    neg = [i for i in list(df_t['amount']) if i < 0] or None
    if pos and neg:
        l = [x for x in list(df_t['amount']) if x == idx_amount * -1]
        if len(l) > 0:
            df_t = df.loc[df['amount'] == idx_amount * -1]
            compare_list = list(df_t['name'])
            list_results = []
            for item in compare_list:
                sim = DistJaccard(idx_name, item)
                list_results.append((item, sim))
            return max(list_results, key=itemgetter(1))
    return None

count = 0
for index, row in df1.iterrows():
    res = create_sim(df1, count)
    if res:
        print(f"The most similar word of {row['name']} is {res[0]} with similarity of {res[1]}")
    else:
        print(f"No similar words of {row['name']}")
    count+=1

编辑:

要使用结果生成DF,您可以将其更改为:

count = 0
item1_id = []
item1_row = []
item1_name = []
item2_id = []
item2_row = []
item2_name = []
for index, row in df1.iterrows():
    res = create_sim(df1, count)
    item1_id.append(row['id'])
    item1_row.append(count)
    item1_name.append(row['name'])
    if res:
        row_idx = df1.loc[(df1['id'] == res[2]) & (df1['name'] == res[0]) & (df1['amount'] != row['amount']), "name"].index.tolist()
        item2_id.append(row['id'])
        item2_row.append(row_idx[0])
        item2_name.append(res[0])
    else:
        item2_id.append(None)
        item2_row.append(None)
        item2_name.append(None)
    count+=1


final = pd.DataFrame(item1_id, columns=['item 1 id'])
final['item 1 row'] = item1_row
final['item 1 name'] = item1_name
final['item 2 id'] = item2_id
final['item 2 row'] = item2_row
final['item 2 name'] = item2_name

print(final)

相关问题 更多 >