基于两个字符串之间相似性度量条件的分组数据帧

2024-09-30 10:29:27 发布

您现在位置:Python中文网/ 问答频道 /正文

我想按“code”列对数据帧进行分组,但仅当“name”中的值明显不同时

d = {'code': ['ABC', 'ABC','DB','DB','CDP'], 'name': ['abcde','abc de', 'defs','wokj','lkj']}
df = pd.DataFrame(data=d)
print(df)

  code    name
0  ABC   abcde
1  ABC  abc de
2   DB    defs
3   DB    wokj
4  CDP     lkj

那会是什么样子

df2 = df.groupby(['code']).agg(name = ('name', (' + '.join))).reset_index()
print(df2)

 code            name
0  ABC  abcde + abc de
1  CDP             lkj
2   DB     defs + wokj

但ABC不应该是分组的,而是根据如下条件保持为单独的行值

from difflib import SequenceMatcher

def similar(a, b):
    return SequenceMatcher(None, a, b).ratio()

print(similar('abcde', 'abc de'))
print(similar('defs', 'wokj'))

0.9090909090909091
0.0

我想要的最终结果是

 code            name
0  ABC          abcde
1  ABC         abc de
1  CDP             lkj
2   DB     defs + wokj

如何在groupby中设置条件


Tags: namedfdbcodededefssimilardf2
1条回答
网友
1楼 · 发布于 2024-09-30 10:29:27

这可能不是一个很好的解决方案,但我希望这对你有用。有些作品可以做得更像Python

import numpy as np
import pandas as pd
from difflib import SequenceMatcher

def similar(dfg):
    df = pd.DataFrame(columns=['code', 'name'])

    if len(dfg) > 1:
        dfg = dfg.assign(a=1).merge(dfg[['name']].assign(a=1), on='a')
        dfg = dfg[dfg['name_x'] != dfg['name_y']]
        dfg[['name_x', 'name_y']] = pd.DataFrame(np.sort(dfg[['name_x', 'name_y']], axis=1), index=dfg.index)
        dfg = dfg.drop_duplicates(subset=['name_x', 'name_y'])
        dfg['sim'] = dfg.apply(lambda x: SequenceMatcher(None, x.name_x, x.name_y).ratio(), axis=1)

        for index, row in dfg.iterrows():
            if row['sim'] > 0:
                # this block could be more pythonic 
                row['name'] = row['name_x']
                df = df.append(row, sort=False)
                row['name'] = row['name_y']
                df = df.append(row, sort=False)
            else:
                row['name'] = row.name_x + ' + ' + row.name_y
                df = df.append(row, sort=False)
    else:
        df = df.append(dfg, sort=False)

    return df[['code', 'name']]

d = {'code': ['ABC', 'ABC', 'ABC', 'DB','DB','CDP'], 'name': ['abcde','abc de', 'xyz', 'defs','wokj','lkj']}
df = pd.DataFrame(data=d)
print(df)

df2 = df.groupby(['code']).apply(similar)
print(df2)

输入:

  code    name
0  ABC   abcde
1  ABC  abc de
2   DB    defs
3   DB    wokj
4  CDP     lkj

输出:

       code         name
code                    
ABC  1  ABC       abc de
     1  ABC        abcde
CDP  4  CDP          lkj
DB   1   DB  defs + wokj

输入:

 code    name
0  ABC   abcde
1  ABC  abc de
2  ABC     xyz
3   DB    defs
4   DB    wokj
5  CDP     lkj

输出:

       code          name
code                     
ABC  1  ABC        abc de
     1  ABC         abcde
     2  ABC   abcde + xyz
     5  ABC  abc de + xyz
CDP  5  CDP           lkj
DB   1   DB   defs + wokj

相关问题 更多 >

    热门问题