基于逗号拆分元素是否是Python中另一列的子字符串合并两个数据帧

0 1 0 Chemicals Electrical Equipment 1 Information Technology Software & Services 2 Tobacco Beverages 3 Pharmaceuticals Health Care 4 Technology None 5 Oil and Gas Energy 6 Technology Food Products 7 Manufacturing None

company_name ... label 0 Looney Tunes ... second tiers, second tiers 1 The Simpsons ... third tiers 2 Soylent Green ... first tiers 3 Initech ... NaN 4 Resident Evil ... third tiers 5 Hooli ... second tiers 6 Weeds ... third tiers, first tiers 7 Fringe ... second tiers

1条回答

网友

1楼 · 发布于 2024-09-26 18:02:24

首先将^{}与^{}一起使用，然后使用convert index to column合并以避免丢失值，通过join聚合并添加到原始df1：

df11 = df1.assign(industry_keywords = df1['industry_keywords'].str.split(', ')).explode('industry_keywords')
df22 = df2.assign(industry_keywords = df2['industry_keywords'].str.split(', ')).explode('industry_keywords')

s = (df11.reset_index()
         .merge(df22, on='industry_keywords')
         .groupby('index')['label']
         .agg(', '.join))

df1 = df1.join(s)
print (df1)

   company_name                             industry_keywords  \
0   Looney Tunes              Chemicals, Electrical Equipment   
1   The Simpsons  Information Technology, Software & Services   
2  Soylent Green                           Tobacco, Beverages   
3        Initech                 Pharmaceuticals, Health Care   
4  Resident Evil                                   Technology   
5          Hooli                          Oil and Gas, Energy   
6          Weeds                    Technology, Food Products   
7         Fringe                                Manufacturing   

                        label  
0  second tiers, second tiers  
1                 third tiers  
2                 first tiers  
3                         NaN  
4                 third tiers  
5                second tiers  
6    third tiers, first tiers  
7                second tiers

要删除重复项，请使用^{}：

s = (df11.reset_index()
         .merge(df22, on='industry_keywords')
         .drop_duplicates(['index','label'])
         .groupby('index')['label']
         .agg(', '.join))

df1 = df1.join(s)
print (df1)
   company_name                             industry_keywords  \
0   Looney Tunes              Chemicals, Electrical Equipment   
1   The Simpsons  Information Technology, Software & Services   
2  Soylent Green                           Tobacco, Beverages   
3        Initech                 Pharmaceuticals, Health Care   
4  Resident Evil                                   Technology   
5          Hooli                          Oil and Gas, Energy   
6          Weeds                    Technology, Food Products   
7         Fringe                                Manufacturing   

                      label  
0              second tiers  
1               third tiers  
2               first tiers  
3                       NaN  
4               third tiers  
5              second tiers  
6  third tiers, first tiers  
7              second tiers

相关问题更多 >

编程相关推荐

热门问题

热门文章