检查是否存在多个关键字，并使用python创建另一列

df = pd.DataFrame({'meds': ['Calcium Acetate','insulin GLARGINE -- LANTUS - inJECTable','amoxicillin 1 g + clavulanic acid 200 mg ','digoxin - TABLET'], 'details':['DOSE: 667 mg - TDS with food - Inject','DOSE: 12 unit(s) - ON - SC (SubCutaneous)','-- AUGMENTIN - inJECTable','DOSE: 62.5 mcg - Every other morning - PO'], 'extracted':['Calcium Acetate 667 mg Inject','insulin GLARGINE -- LANTUS 12 unit(s) - SC (SubCutaneous)','amoxicillin 1 g + clavulanic acid 200 mg -- AUGMENTIN','digoxin - TABLET 62.5 mcg PO/Tube']}) df['concatenated'] = df['meds'] + " "+ df['details']

df['clean_extract'] = df.extracted.str.extract(r'([a-zA-Z0-9\s]+)') #the above regex is incorrect. I would like to clean the text (remove all symbols except spaces and retain a clean text) df['keywords'] = df.clean_extract.str.split(' ') #split them into keywords def value_present(row): #check whether each of the keyword is present in `concatenated` column if isinstance(row['keywords'], list): for keyword in row['keywords']: return 1 else: return 0 df['output'] = df[df.apply(value_present, axis=1)][['concatenated', 'keywords']].head()

1条回答

网友

1楼 · 发布于 2024-09-29 21:44:41

让我们压缩列extracted和concatenated，并为每一对将其映射到函数f，该函数计算set差并相应地返回结果：

def f(x, y):
    s = set(x.split()) - set(y.split())
    return [0, ', '.join(s)] if s else [1, np.nan]

df[['output', 'issue']] = [f(*s) for s in zip(df['extracted'], df['concatenated'])]

   output    issue
0       1      NaN
1       1      NaN
2       1      NaN
3       0  PO/Tube

相关问题更多 >

编程相关推荐

热门问题

热门文章