我有一个如下所示的数据框
df = pd.DataFrame({'meds': ['Calcium Acetate','insulin GLARGINE -- LANTUS - inJECTable','amoxicillin 1 g + clavulanic acid 200 mg ','digoxin - TABLET'],
'details':['DOSE: 667 mg - TDS with food - Inject','DOSE: 12 unit(s) - ON - SC (SubCutaneous)','-- AUGMENTIN - inJECTable','DOSE: 62.5 mcg - Every other morning - PO'],
'extracted':['Calcium Acetate 667 mg Inject','insulin GLARGINE -- LANTUS 12 unit(s) - SC (SubCutaneous)','amoxicillin 1 g + clavulanic acid 200 mg -- AUGMENTIN','digoxin - TABLET 62.5 mcg PO/Tube']})
df['concatenated'] = df['meds'] + " "+ df['details']
我想做的是
a)检查extracted
列中的所有单个关键字是否存在于concatenated
列中
b)如果存在,将1
分配给output
列else0
c)在issue
列中分配not found关键字,如下所示
所以,我试着做下面的事情
df['clean_extract'] = df.extracted.str.extract(r'([a-zA-Z0-9\s]+)')
#the above regex is incorrect. I would like to clean the text (remove all symbols except spaces and retain a clean text)
df['keywords'] = df.clean_extract.str.split(' ') #split them into keywords
def value_present(row): #check whether each of the keyword is present in `concatenated` column
if isinstance(row['keywords'], list):
for keyword in row['keywords']:
return 1
else:
return 0
df['output'] = df[df.apply(value_present, axis=1)][['concatenated', 'keywords']].head()
如果您认为清理concatenated
列也很有用,那么这很好。我只对查找所有关键字的存在感兴趣
在700-800万条记录上,是否有任何有效且优雅的方法可以做到这一点
我希望我的输出如下所示。红色表示extracted
和concatenated
列之间缺少项。因此,其指定的0和关键字存储在issue
列中
让我们压缩列
extracted
和concatenated
,并为每一对将其映射到函数f
,该函数计算set
差并相应地返回结果:相关问题 更多 >
编程相关推荐