节选词以大写字母开头

网友

1楼 · 编辑于 2024-09-29 17:24:13

下面是一个使用re.findall的选项：

text1 = "sedentary. Allan Takocok. That's the conclusion of two studies published in this week's issue of The New England Journal of Medicine."
matches = re.findall(r'(?:(?<=^)|(?<=[^.]))\s+([A-Z][a-z]+)', text1)
print(matches)

这张照片：

['Takocok', 'The', 'New', 'England', 'Journal', 'Medicine']

下面是正则表达式模式的解释：

(?:(?<=^)|(?<=[^.]))   assert that what precedes is either the start of the string,
                       or a non full stop character
\s+                    then match (but do not capture) one or more spaces
([A-Z][a-z]+)          then match AND capture a word starting with a capital letter

网友

2楼 · 编辑于 2024-09-29 17:24:13

在这种情况下，可能会找到一个正则表达式，但它往往会变得混乱。你知道吗

相反，我建议分两步进行：

将文本拆分为标记
使用这些标记来提取有趣的单词

tokens = [
    'sedentary',
    '.',
    ' ',
    'Allan',
    ' ',
    'Takocok',
    '.',
    ' ',
    'That\'s',
    …
]

这种令牌拆分已经足够复杂了。你知道吗

使用这个标记列表，可以更容易地表达实际需求，因为您现在使用的是定义良好的标记，而不是任意字符序列。你知道吗

我在令牌列表中保留了空格，因为您可能想区分“a”。品牌名称'或'www.example.org“还有句末的点。你知道吗

使用这个标记列表，比以前更容易表达规则，比如“必须在前面加一个点”。你知道吗

我希望你的规则随着时间的推移变得相当复杂，因为你处理的是自然语言文本。因此，对令牌的抽象。你知道吗

网友

3楼 · 编辑于 2024-09-29 17:24:13

这应该是你要找的正则表达式：

(?<!\.)\s+([A-Z][A-Za-z]+)

在这里查看regex101：https://regex101.com/r/EoPqgw/1

相关问题更多 >

编程相关推荐

热门问题

热门文章

节选词以大写字母开头

相关问题 更多 >

编程相关推荐

热门问题

热门文章

相关问题更多 >