我正在使用jupyter笔记本(python 3)。我正试图从我的列表中提取熊猫数据框关键字。我将有大约50个关键字在列表中
例如:
import pandas as pd
import re
rgx_words1 = ['algaecid','algaecide','algaecides','anti-bakterien']
pattern = "\\b("+'|'.join(rgx_words1)+")\\b"
re_patt = re.compile(pattern)
pattern2 = "("+'|'.join(rgx_words1)+")"
re_patt2 = re.compile(pattern2)
data = [[1, 'I, will, find, algaecide, dd, algaecid, algaecides'], [2, 'fff, algaecid, dd, algaecide'], [3, 'ssssalgaecidllll, algaecides']]
# Create the pandas DataFrame
mydf = pd.DataFrame(data, columns = ['id', 'text'])
mydf['matches'] = mydf.apply(lambda x : re.findall(re_patt,x['text']),axis=1)
mydf['matches2'] = mydf.apply(lambda x : re.findall(re_patt2,x['text']),axis=1)
通过Reu patt,我提取了精确的单词,得到了正确的结果。在id 1中,我的输出是algaecide,algaecid,algaecides。有了re_patt2,我希望所有模式都像“SSSSALGAECIDLLL”一样,输出为“algaecid”。id 1中re_patt2的输出为algaecid、algaecid、algaecid,我想要的输出为algaecid、algaecid、algaecides。 如蒙指教,我将不胜感激。先谢谢你
您可以将
pattern2
更改为可选地匹配非空白字符,但左侧和右侧的逗号[^\s,]*
除外代码可能看起来像
输出
相关问题 更多 >
编程相关推荐