我正在解析维基百科中的一些信息,转储中的文本包括{{content}}
或[[content]]
形状的链接和图像的特殊注释。我想把课文分成几个句子,但问题是当一个点后面不是空格而是前面的一个符号时。你知道吗
所以,一般来说,它必须在'. ', '.{{', '.[['
发生时分裂。你知道吗
示例:
prueba = 'Anarchism does not offer a fixed body of doctrine from a single particular worldview.{{sfn|Marshall|1993|pp=14–17}} Many types and traditions of anarchism exist, not all of which are mutually exclusive.[[Sylvan|2007|p=262]] [[Anarchist schools of thought]] can differ fundamentally, supporting anything from extreme [[individualism]] to complete [[collectivism]].'
sentences = re.split('(?<!\w\.\w.)(?<![A-Z][a-z]\.)(?<=\.|\?)\s', prueba)
为了便于阅读,我又把这段文字贴在这里
Anarchism does not offer a fixed body of doctrine from a single particular worldview.{{sfn|Marshall|1993|pp=14–17}} Many types and traditions of anarchism exist, not all of which are mutually exclusive.[[Sylvan|2007|p=262]] [[Anarchist schools of thought]] can differ fundamentally, supporting anything from extreme [[individualism]] to complete [[collectivism]].
此代码的输出是一个列表,其中只有一项包含整个文本:
['Anarchism does not offer a fixed body of doctrine from a single particular worldview.{{sfn|Marshall|1993|pp=14–17}} Many types and traditions of anarchism exist, not all of which are mutually exclusive.[[sfn|Sylvan|2007|p=262]] [[Anarchist schools of thought]] can differ fundamentally, supporting anything from extreme [[individualism]] to complete [[collectivism]].']
但我需要一份清单,上面有三个项目:
['Anarchism does not offer a fixed body of doctrine from a single particular worldview.', '{{sfn|Marshall|1993|pp=14–17}} Many types and traditions of anarchism exist, not all of which are mutually exclusive.', '[[Sylvan|2007|p=262]] [[Anarchist schools of thought]] can differ fundamentally, supporting anything from extreme [[individualism]] to complete [[collectivism]].']
如何修复正则表达式代码?我尝试了不同的解决办法,但没有达到预期的效果。你知道吗
提前谢谢。你知道吗
既然您似乎试图保留分隔符,那么您可能需要
re.findall()
。请看下面的答案https://stackoverflow.com/a/44244698/11199887,然后根据您的情况进行调整。使用re.findall()
,您不必担心.{{
和.
和.[[
之间的差异在上面的例子中,你不仅要捕捉句点,还要捕捉结束句子的问号和感叹号。在维基百科上,可能没有很多以感叹号或问号结尾的句子,但我并没有真正花时间去寻找例子
对于您的情况,它看起来是这样的:
或者如果你真的只想分时段。你知道吗
sentences = re.findall('.*?[.]', prueba)
print(sentences)
的输出是:相关问题 更多 >
编程相关推荐