数一数单词,但忽略前面首字母大写的单词

2024-09-30 06:11:15 发布

您现在位置:Python中文网/ 问答频道 /正文

我想确定牢房里是否有“麦当劳”这个词。然而,我希望忽略“McDonald”之前的单词有首个大写字母的情况,如“Kevin McDonald”。有没有建议如何在一个数据帧中通过正则表达式来实现它

data = {'text':["Kevin McDonald has bought a burger.", 
                "The best burger in McDonald is cheeze buger."]}

df = pd.DataFrame(data)
long_list = ['McDonald', 'Five Guys']

# matching any of the words
pattern = r'\b{}\b'.format('|'.join(long_list))

df['count'] = df.text.str.count(pattern)
                                           text
0           Kevin McDonald has bought a burger.
1  The best burger in McDonald is cheeze buger.

预期产出:

                                           text  count
0           Kevin McDonald has bought a burger.      0
1  The best burger in McDonald is cheeze buger.      1

Tags: thetextindfdataiscountbest
2条回答

您可以尝试以下模式:

pattern = r'\b[a-z].*?\b {}'.format('|'.join(long_list))

df['count'] = df.text.str.count(pattern)

IIUC,目标是在前面有大写的单词时不匹配。检查之前是否有一个非大写的单词会消除许多合法的可能性

下面是一个正则表达式,它可以提供更多的可能性(句子开头,非单词之前):

regex = '|'.join(fr'(?:\b[^A-Z]\S*\s+|[^\w\s] ?|^){i}' for i in long_list)
df['count'] = df['text'].str.count(regex)

例如:

                                           text  count
0           Kevin McDonald has bought a burger.      0
1  The best burger in McDonald is cheeze buger.      1
2                       McDonald's restaurants.      1
3                 Blah. McDonald's restaurants.      1

您可以测试并理解regexhere

相关问题 更多 >

    热门问题