使用前面带有非字母字符的空格拆分文本

2024-10-05 12:24:44 发布

您现在位置:Python中文网/ 问答频道 /正文

由于我在互联网上找不到任何解决方案,我就想在这里问我的问题

我想在每个标点处拆分给定的文本。所以不仅在每个句子后面,而且在逗号后面。到目前为止,我遇到了自然语言工具包(tltk)和正则表达式,但没有成功使用它们

这是一个很好的方法,但不能完全满足我的期望:

sample_text = """With this example I wanna make the point clear... I hope you get it! There are many coding
languages out there, but which is the best? I would say there's no best. Change my mind - if you can!"""

split_text = nltk.tokenize.sent_tokenize(sample_text)
print(split_text)

#Output: ['With this example I wanna make the point clear...', 'I hope you get it!', 'There are many coding languages out there, but which is the best?', "I would say there's no best.", 'Change my mind - if you can!']

这已经很好了,但我更希望收到一个输出,它甚至可以在逗号或连字符处拆分文本。因此,输出将如下所示:

[
 'With this example I wanna make the point clear...',
 'I hope you get it!',
 'There are many coding languages out there,',
 'but which is the best?',
 "I would say there's no best.",
 'Change my mind -',
 'if you can!'
]

使用正则表达式可能更好,不是吗?但不知怎么的,我没有让它工作。 提前感谢,感谢您的帮助


Tags: thetextyougetmakeexamplewithit
2条回答

正则表达式工作正常,请尝试在.split()中使用此表达式 [!"\#$%&'()*+,\-.\/:;<=>?@\[\\\]^_‘{|}~]

可以在前面没有字母的空格上拆分字符串:

split_text = re.split('(?<=[^a-z]) ', sample_text, 0, re.I)
print(split_text)

输出:

[
 'With this example I wanna make the point clear...',
 'I hope you get it!',
 'There are many coding languages out there,',
 'but which is the best?',
 "I would say there's no best.",
 'Change my mind -',
 'if you can!'
]

相关问题 更多 >

    热门问题