提取多个关键字并指定给新列

2024-10-16 22:26:10 发布

您现在位置:Python中文网/ 问答频道 /正文

概述

我有下面的代码,它从“Jobtitle”列中的一个大字符串中提取术语,并将指定的单词(在本例中为“engineer”、“scientist”、“analyst”)分配给一个名为“job_cat”的新列

问题

目前,后续每行代码都会覆盖上面的行,例如,只有行“analyst”适用,“Jobtitle”列中的所有其他值都是“other”,甚至应该是“engineer”或“scientist”

如何构造代码,以便将所有3个值提取到新列“Jobtitle”中

glassdoor['job_cat'] = np.where(glassdoor['Jobtitle'].str.contains('engineer'), 'engineer', 'other') 
glassdoor['job_cat'] = np.where(glassdoor['Jobtitle'].str.contains('scientist'), 'scientist', 'other') 
glassdoor['job_cat'] = np.where(glassdoor['Jobtitle'].str.contains('analyst'), 'analyst', 'other') 


Tags: 字符串代码npjobwherecat术语other
3条回答

像这样的

import pandas as pd

df = pd.DataFrame([{'Jobtitle': 'I am scientist'},{'Jobtitle': 'I am no one'}])


def job_cat(x):
    if 'engineer' in x:
      return 'engineer'
    elif 'scientist' in x:
      return 'scientist'
    elif 'analyst' in x:
      return 'analyst'
    else:
      return 'other'


df['job_cat'] = df['Jobtitle'].apply(job_cat)

print(df)

输出

         Jobtitle    job_cat
0  I am scientist  scientist
1     I am no one      other

我看到您正在使用函数

  • 您可以使用extract函数而不是contains。这会同时获得所有匹配项
  • 对于没有任何值的值,可以使用fillna插入other
df = pd.DataFrame(
         ['im scientist', 'im engineer', 'im analyst', 'nothing'], 
         columns=['jobtitle']
     )

df['job_cat'] = df['jobtitle'].str.extract("(scientist|engineer|analyst)")
df['job_cat'] = df['job_cat'].fillna("other")

输出:

            jobtitle    job_cat
0   i am a scientist  scientist
1   this is engineer   engineer
2  hey im an analyst    analyst
3        hey nothing      other

选项1:使用^{}的优雅解决方案

import pandas as pd

data = {'jobtitle': ['job scientist', 'job is engineer', 'job analyst', 'hey nothing']}

glassdoor = pd.DataFrame(data)

# Find and replace those meeting jobs key words
for job_option in ['engineer', 'analyst', 'scientist']:
    glassdoor.loc[(glassdoor['jobtitle'].str.contains(job_option), 'job_cat')] = job_option
# Fill NaN with other
glassdoor['job_cat'] = glassdoor['job_cat'].fillna("other")

# Print the output.
print(glassdoor)

输出:

          jobtitle    job_cat
0    job scientist  scientist
1  job is engineer   engineer
2      job analyst    analyst
3      hey nothing      other

选项2:多次使用np.where

glassdoor['job_cat'] = np.where(glassdoor['Jobtitle'].str.contains('analyst'), 'analyst', np.where(glassdoor['Jobtitle'].str.contains('scientist'), 'scientist', np.where(glassdoor['Jobtitle'].str.contains('engineer'), 'engineer', 'other')))

相关问题 更多 >