对于大多数新闻文章,第一句话总是以一个位置开头,后跟连字符或逗号,例如
KUALA LUMPUR: North Korea and Malaysia on Monday locked horns over the investigation into the killing of leader Kim Jong-Un’s brother, as footage emerged of the moment he was fatally attacked in a Kuala Lumpur airport.
PORTLAND, Maine — FairPoint Communications has asked regulators for permission to stop signing up new customers for regulated landline service in Scarborough, Gorham, Waterville, Kennebunk and Cape Elizabeth.
我试着用re把后面的半句分开,这是主句,比如
North Korea and Malaysia on Monday locked horns over the investigation into the killing of leader Kim Jong-Un’s brother, as footage emerged of the moment he was fatally attacked in a Kuala Lumpur airport.
我使用以下regrex将它们分开:
sep = re.split('-|:|--', sent)
但这并不是万能的,第二句话的结果是:
['PORTLAND, Maine \xe2\x80\x94 FairPoint Communications has asked regulators for permission to stop signing up new customers for regulated landline service in Scarborough, Gorham, Waterville, Kennebunk and Cape Elizabeth.']
unicode有什么关系吗?或者我需要在重新编码时传递不同格式的连字符?在
有没有一种通用的方法可以做得更好?在
谢谢。在
正如您所猜测的,问题在于字符串中存在unicode字符,因为没有一个ASCII字符具有与em破折号相同的值,}。在
PORTLAND, Maine — FairPoint Communications
中的分隔符解释不好,变成了\xe2\x80\x94
,而不是{有几个选项可以让您随心所欲:
# -*- coding: utf-8 -*-
设置为前两行中的任意一行),并将额外字符添加到正则表达式中。在sep = re.split(ur'-|:| |\u2014', sent)
)兼容的unicode正则表达式因为第二个句子包含UNICODE字符,所以在执行代码之前需要define source code encoding,因为python的默认编码是ASCII。而且,你试图用错误的字符
来吐出这个句子。它必须是
—
(它是UNICODE)python(demo)
相关问题 更多 >
编程相关推荐