python，regex查找子内容，排除bounaries

2条回答

网友

1楼 · 编辑于 2024-06-02 05:48:18

您可以使用捕获组

regex = "([A-Z].*?)[\.!?;]"

。。。您是否正在搜索，为每个匹配获取匹配对象。。。。你知道吗

sentence = match_obj.groups(1)

我还注意到你坚持所有的句子都以大写字母开头，但你在第一个分号处终止它们。我想说“句子”通常是指“所有的”、“连接的分隔从句”。但是如果你想把“；”作为分隔符，那么我会把每个子句都算作一个句子（因为它是，除了大写字母）。你知道吗

网友

2楼 · 编辑于 2024-06-02 05:48:18

使用捕获组：

sentences = re.findall(r'([A-Z].*?)[.!?;]', stripped_value, re.MULTILINE | re.DOTALL | re.UNICODE)

.findall()返回捕获组的内容，而不是整个匹配项（如果表达式中存在匹配项）。你知道吗

演示：

>>> stripped_value = '''Some sentence. And another.
... Multiline text works too! And commas, they are included; but not the semicolon?
... '''
>>> import re
>>> re.findall(r'([A-Z].*?)[.!?;]', stripped_value, re.MULTILINE | re.DOTALL | re.UNICODE)
['Some sentence', 'And another', 'Multiline text works too', 'And commas, they are included']

从^{} documentation：

If one or more groups are present in the pattern, return a list of groups; this will be a list of tuples if the pattern has more than one group.

或者，也可以使用先行断言：

sentences = re.findall("[A-Z].*?(?=[\.!?;])", stripped_value, re.MULTILINE | re.DOTALL | re.UNICODE)

(?=..)肯定的先行断言充当锚；只有在匹配的文本后面加上punctionation时，模式才匹配。Lookaheads可以给您更快的结果，因为.findall()不必排除匹配的组。两种方案的输出在其他方面是相同的。你知道吗

相关问题更多 >

编程相关推荐

热门问题

热门文章

python，regex查找子内容，排除bounaries

相关问题 更多 >

编程相关推荐

热门问题

热门文章

相关问题更多 >