基于逗号拆分元素是否是Python中另一列的子字符串合并两个数据帧

2024-09-26 18:02:24 发布

您现在位置:Python中文网/ 问答频道 /正文

给定数据帧df1,如下所示:

^{tb1}$

要拆分industry_keywords列,我使用df1['industry_keywords'].str.split(',', expand=True)

输出:

                        0                      1
0               Chemicals   Electrical Equipment
1  Information Technology    Software & Services
2                 Tobacco              Beverages
3         Pharmaceuticals            Health Care
4              Technology                   None
5             Oil and Gas                 Energy
6              Technology          Food Products
7           Manufacturing                   None

df2:

^{tb2}$

我想把{}从{}映射到{},如果{}中的{}被{}分割,包含在{}中的{}中,也被逗号分割

预期结果如下:

    company_name  ...                       label
0   Looney Tunes  ...  second tiers, second tiers
1   The Simpsons  ...                 third tiers
2  Soylent Green  ...                 first tiers
3        Initech  ...                         NaN
4  Resident Evil  ...                 third tiers
5          Hooli  ...                second tiers
6          Weeds  ...    third tiers, first tiers
7         Fringe  ...                second tiers

我怎么能这么做?也许我应该为df2创建一个dictionary文件?谢谢


Tags: 数据nonefirstsplitexpanddf1df2second
1条回答
网友
1楼 · 发布于 2024-09-26 18:02:24

首先将^{}^{}一起使用,然后使用convert index to column合并以避免丢失值,通过join聚合并添加到原始df1

df11 = df1.assign(industry_keywords = df1['industry_keywords'].str.split(', ')).explode('industry_keywords')
df22 = df2.assign(industry_keywords = df2['industry_keywords'].str.split(', ')).explode('industry_keywords')

s = (df11.reset_index()
         .merge(df22, on='industry_keywords')
         .groupby('index')['label']
         .agg(', '.join))

df1 = df1.join(s)
print (df1)

   company_name                             industry_keywords  \
0   Looney Tunes              Chemicals, Electrical Equipment   
1   The Simpsons  Information Technology, Software & Services   
2  Soylent Green                           Tobacco, Beverages   
3        Initech                 Pharmaceuticals, Health Care   
4  Resident Evil                                   Technology   
5          Hooli                          Oil and Gas, Energy   
6          Weeds                    Technology, Food Products   
7         Fringe                                Manufacturing   

                        label  
0  second tiers, second tiers  
1                 third tiers  
2                 first tiers  
3                         NaN  
4                 third tiers  
5                second tiers  
6    third tiers, first tiers  
7                second tiers  

要删除重复项,请使用^{}

s = (df11.reset_index()
         .merge(df22, on='industry_keywords')
         .drop_duplicates(['index','label'])
         .groupby('index')['label']
         .agg(', '.join))

df1 = df1.join(s)
print (df1)
   company_name                             industry_keywords  \
0   Looney Tunes              Chemicals, Electrical Equipment   
1   The Simpsons  Information Technology, Software & Services   
2  Soylent Green                           Tobacco, Beverages   
3        Initech                 Pharmaceuticals, Health Care   
4  Resident Evil                                   Technology   
5          Hooli                          Oil and Gas, Energy   
6          Weeds                    Technology, Food Products   
7         Fringe                                Manufacturing   

                      label  
0              second tiers  
1               third tiers  
2               first tiers  
3                       NaN  
4               third tiers  
5              second tiers  
6  third tiers, first tiers  
7              second tiers  

相关问题 更多 >

    热门问题