在句子中以“{{{”的特殊形式拆分文本

2024-09-27 02:18:06 发布

您现在位置:Python中文网/ 问答频道 /正文

我正在解析维基百科中的一些信息,转储中的文本包括{{content}}[[content]]形状的链接和图像的特殊注释。我想把课文分成几个句子,但问题是当一个点后面不是空格而是前面的一个符号时。你知道吗

所以,一般来说,它必须在'. ', '.{{', '.[['发生时分裂。你知道吗

示例:

prueba = 'Anarchism does not offer a fixed body of doctrine from a single particular worldview.{{sfn|Marshall|1993|pp=14–17}} Many types and traditions of anarchism exist, not all of which are mutually exclusive.[[Sylvan|2007|p=262]] [[Anarchist schools of thought]] can differ fundamentally, supporting anything from extreme [[individualism]] to complete [[collectivism]].'

sentences = re.split('(?<!\w\.\w.)(?<![A-Z][a-z]\.)(?<=\.|\?)\s', prueba)

为了便于阅读,我又把这段文字贴在这里

Anarchism does not offer a fixed body of doctrine from a single particular worldview.{{sfn|Marshall|1993|pp=14–17}} Many types and traditions of anarchism exist, not all of which are mutually exclusive.[[Sylvan|2007|p=262]] [[Anarchist schools of thought]] can differ fundamentally, supporting anything from extreme [[individualism]] to complete [[collectivism]].

此代码的输出是一个列表,其中只有一项包含整个文本:

['Anarchism does not offer a fixed body of doctrine from a single particular worldview.{{sfn|Marshall|1993|pp=14–17}} Many types and traditions of anarchism exist, not all of which are mutually exclusive.[[sfn|Sylvan|2007|p=262]] [[Anarchist schools of thought]] can differ fundamentally, supporting anything from extreme [[individualism]] to complete [[collectivism]].']

但我需要一份清单,上面有三个项目:

['Anarchism does not offer a fixed body of doctrine from a single particular worldview.', '{{sfn|Marshall|1993|pp=14–17}} Many types and traditions of anarchism exist, not all of which are mutually exclusive.', '[[Sylvan|2007|p=262]] [[Anarchist schools of thought]] can differ fundamentally, supporting anything from extreme [[individualism]] to complete [[collectivism]].']

如何修复正则表达式代码?我尝试了不同的解决办法,但没有达到预期的效果。你知道吗

提前谢谢。你知道吗


Tags: offromnotbodyppfixedoffersingle
1条回答
网友
1楼 · 发布于 2024-09-27 02:18:06

既然您似乎试图保留分隔符,那么您可能需要re.findall()。请看下面的答案https://stackoverflow.com/a/44244698/11199887,然后根据您的情况进行调整。使用re.findall(),您不必担心.{{..[[之间的差异

import re

s = """You! Are you Tom? I am Danny."""
re.findall('.*?[.!\?]', s)
# ['You!', ' Are you Tom?', ' I am Danny.']

在上面的例子中,你不仅要捕捉句点,还要捕捉结束句子的问号和感叹号。在维基百科上,可能没有很多以感叹号或问号结尾的句子,但我并没有真正花时间去寻找例子

对于您的情况,它看起来是这样的:

prueba = 'Anarchism does not offer a fixed body of doctrine from a single particular worldview.{{sfn|Marshall|1993|pp=14–17}} Many types and traditions of anarchism exist, not all of which are mutually exclusive.[[Sylvan|2007|p=262]] [[Anarchist schools of thought]] can differ fundamentally, supporting anything from extreme [[individualism]] to complete [[collectivism]].'

sentences = re.findall('.*?[.!\?]', prueba)

或者如果你真的只想分时段。你知道吗

sentences = re.findall('.*?[.]', prueba)

print(sentences)的输出是:

['Anarchism does not offer a fixed body of doctrine from a single particular worldview.',
 '{{sfn|Marshall|1993|pp=14–17}} Many types and traditions of anarchism exist, not all of which are mutually exclusive.',
 '[[Sylvan|2007|p=262]] [[Anarchist schools of thought]] can differ fundamentally, supporting anything from extreme [[individualism]] to complete [[collectivism]].']

相关问题 更多 >

    热门问题