使用“contains”合并数据帧（不是完全匹配！）

3条回答

网友

1楼 · 编辑于 2024-10-03 23:21:34

您可以用^{}尝试^{}：

d=df1.set_index('Account').agg(list,axis=1).to_dict()
p='({})'.format('|'.join(df1.Account))
#'(B36363|G47281|H46291)'
m=pd.DataFrame(df2.Account.str.extract(p,expand=False).map(d).fillna('').tolist()
               ,columns=['ID','Name'],index=df2.index)
df2.join(m)

     Account    Col_B    Col_C               ID        Name
1   B36363-0  text_b1  text_c1          2019001        John
2  01_G47281  text_b2  text_c2  2019002;2018101  Alice;Emma
3   X_H46291  text_b3  text_c3          2019001        John
4  II_G47281  text_b4  text_C4  2019002;2018101  Alice;Emma

网友

2楼 · 编辑于 2024-10-03 23:21:34

使用我的^{}函数：

from fuzzywuzzy import fuzz
from fuzzywuzzy import process

df3 = fuzzy_merge(df2, df1, 'Account', 'Account', threshold=80)\
     .merge(df1, left_on='matches', right_on='Account', suffixes=['', '_2'])\
     .drop(columns=['matches', 'Account_2'])

输出

     Account    Col_B    Col_C               ID        Name
0   B36363-0  text_b1  text_c1          2019001        John
1  01_G47281  text_b2  text_c2  2019002;2018101  Alice;Emma
2  II_G47281  text_b4  text_C4  2019002;2018101  Alice;Emma
3   X_H46291  text_b3  text_c3          2019001        John

来自链接答案的Fuzzy_merge函数：

def fuzzy_merge(df_1, df_2, key1, key2, threshold=90, limit=2):
    """
    df_1 is the left table to join
    df_2 is the right table to join
    key1 is the key column of the left table
    key2 is the key column of the right table
    threshold is how close the matches should be to return a match
    limit is the amount of matches will get returned, these are sorted high to low
    """
    s = df_2[key2].tolist()

    m = df_1[key1].apply(lambda x: process.extract(x, s, limit=limit))    
    df_1['matches'] = m

    m2 = df_1['matches'].apply(lambda x: ', '.join([i[0] for i in x if i[1] >= threshold]))
    df_1['matches'] = m2

    return df_1

网友

3楼 · 编辑于 2024-10-03 23:21:34

在df2.Account上尝试str.extract，并将结果设置为df2和join的索引

pat1 = '('+'|'.join(df1.Account)+')'
s = df2.Account.str.extract(pat1, expand=False)
df2.set_index(s).join(df1.set_index('Account')).reset_index(drop=True)

Out[644]:
     Account    Col_B    Col_C               ID        Name
0   B36363-0  text_b1  text_c1          2019001        John
1  01_G47281  text_b2  text_c2  2019002;2018101  Alice;Emma
2  II_G47281  text_b4  text_C4  2019002;2018101  Alice;Emma
3   X_H46291  text_b3  text_c3          2019001        John

另一种方法是使用merge

df2.assign(Account2=df2.Account.str.extract(pat1, expand=False)) \
   .merge(df1, left_on='Account2', right_on='Account', suffixes=('', 'y')) \
   .drop(['Account2', 'Accounty'], 1)

Out[645]:
     Account    Col_B    Col_C               ID        Name
0   B36363-0  text_b1  text_c1          2019001        John
1  01_G47281  text_b2  text_c2  2019002;2018101  Alice;Emma
2  II_G47281  text_b4  text_C4  2019002;2018101  Alice;Emma
3   X_H46291  text_b3  text_c3          2019001        John

相关问题更多 >

编程相关推荐

热门问题

热门文章

使用“contains”合并数据帧（不是完全匹配！）

相关问题 更多 >

编程相关推荐

热门问题

热门文章

相关问题更多 >