搜索列中的字符串列表

2024-06-26 09:52:10 发布

您现在位置:Python中文网/ 问答频道 /正文

我有一个要搜索的字符串列表。你知道吗

strings = ['Tea','Baseball','Onus']

我的数据帧是

   itemid   desc
0  101      tea leaves
1  201      baseball gloves
3  221      teas leaves from Onus Green Tea Co.

我想得到这样的东西,不考虑部分匹配

   itemid   desc                                 matches
0  101      tea leaves                           [Tea]
1  201      baseball gloves                      [Baseball]
2  221      teas leaves from Onus Green Tea Co.   [Tea, Onus]

我正在这么做

import re
df['desc'] = df.desc.str.split(' ')
df['desc'].str.findall('|'.join(strings),flags=re.IGNORECASE)

但它给了我一系列空逗号

0     [(, , , , , ), (, , , , , ), (, , , , , )]
1     [(, , , , , ), (, , , , , ), (, , , , , )]
2     [(, , , , , ), (, , , , , ), (, , , , , )]

请帮我解决这个问题。你知道吗

编辑:我不想要部分匹配。更新的例子反映了这一点。你知道吗


Tags: fromdfgreendescleavescostringstea
3条回答

尝试将contains与regex交替使用:

strings = ['Tea','Baseball','Onus']
rgx = '\\b(?:' + '|'.join(strings) + ')\\b'
df[df.desc.str.contains(rgx, regex=True, na=False)]

我们可以将Series.str.findall与regex ignore case标志(?i)一起使用,这样就不必使用import re

df['Matches'] = df['desc'].str.findall(f'(?i)({"|".join(strings)})')

   itemid                                desc           Matches
0     101                          tea leaves             [tea]
1     201                     baseball gloves        [baseball]
2     221  tea leaves from Onus Green Tea Co.  [tea, Onus, Tea]

要删除重复项,我们将字符串转换为大写,并生成set

df['Matches'] = (
    df['desc'].str.findall(f'(?i)({"|".join(strings)})')
    .apply(lambda x: list(set(map(str.upper, x))))
)
   itemid                                desc      Matches
0     101                          tea leaves        [TEA]
1     201                     baseball gloves   [BASEBALL]
2     221  tea leaves from Onus Green Tea Co.  [TEA, ONUS]

编辑部分匹配

我们可以使用单词边界\b

strings = ['\\b' + f + '\\b' for f in strings]

df['Matches'] = df['desc'].str.findall(f'(?i)({"|".join(strings)})')
   itemid                                 desc      Matches
0     101                           tea leaves        [tea]
1     201                      baseball gloves   [baseball]
2     221  teas leaves from Onus Green Tea Co.  [Onus, Tea]

您不需要吐出desc列。你知道吗

import re
strings = ['Tea','Baseball','Onus']     
df = pd.DataFrame({"desc": ['tea leaves', 'baseball gloves', 'tea leaves from Onus Green Tea Co.']})
df['matches'] = df['desc'].str.findall('|'.join(strings),flags=re.IGNORECASE)
print(df['matches'])

输出:

0               [tea]
1          [baseball]
2    [tea, Onus, Tea]
Name: matches, dtype: object

相关问题 更多 >