检查是否存在多个关键字,并使用python创建另一列

2024-09-29 21:44:41 发布

您现在位置:Python中文网/ 问答频道 /正文

我有一个如下所示的数据框

df = pd.DataFrame({'meds': ['Calcium Acetate','insulin GLARGINE -- LANTUS -  inJECTable','amoxicillin  1 g  + clavulanic acid  200 mg ','digoxin  - TABLET'],
                   'details':['DOSE: 667 mg - TDS with food - Inject','DOSE:   12 unit(s)  -  ON  -  SC (SubCutaneous)','-- AUGMENTIN -  inJECTable','DOSE:   62.5 mcg  -  Every other morning  -  PO'],
                   'extracted':['Calcium Acetate 667 mg Inject','insulin GLARGINE -- LANTUS 12 unit(s) -  SC (SubCutaneous)','amoxicillin  1 g  + clavulanic acid  200 mg -- AUGMENTIN','digoxin  - TABLET 62.5 mcg PO/Tube']})
df['concatenated'] = df['meds'] + " "+ df['details']

我想做的是

a)检查extracted列中的所有单个关键字是否存在于concatenated列中

b)如果存在,将1分配给output列else0

c)在issue列中分配not found关键字,如下所示

所以,我试着做下面的事情

df['clean_extract'] = df.extracted.str.extract(r'([a-zA-Z0-9\s]+)') 
 #the above regex is incorrect. I would like to clean the text (remove all symbols except spaces and retain a clean text)
df['keywords'] = df.clean_extract.str.split(' ') #split them into keywords
def value_present(row):   #check whether each of the keyword is present in `concatenated` column
    if isinstance(row['keywords'], list):
        for keyword in row['keywords']:
            return 1
    else:
        return 0

df['output'] = df[df.apply(value_present, axis=1)][['concatenated', 'keywords']].head()

如果您认为清理concatenated列也很有用,那么这很好。我只对查找所有关键字的存在感兴趣

在700-800万条记录上,是否有任何有效且优雅的方法可以做到这一点

我希望我的输出如下所示。红色表示extractedconcatenated列之间缺少项。因此,其指定的0和关键字存储在issue列中

enter image description here


Tags: thecleandfextract关键字rowkeywordspresent
1条回答
网友
1楼 · 发布于 2024-09-29 21:44:41

让我们压缩列extractedconcatenated,并为每一对将其映射到函数f,该函数计算set差并相应地返回结果:

def f(x, y):
    s = set(x.split()) - set(y.split())
    return [0, ', '.join(s)] if s else [1, np.nan]

df[['output', 'issue']] = [f(*s) for s in zip(df['extracted'], df['concatenated'])]

   output    issue
0       1      NaN
1       1      NaN
2       1      NaN
3       0  PO/Tube

相关问题 更多 >

    热门问题