Python将新闻文章中的第一个句子拆分,然后使用

2024-09-27 04:30:16 发布

您现在位置:Python中文网/ 问答频道 /正文

对于大多数新闻文章,第一句话总是以一个位置开头,后跟连字符或逗号,例如

KUALA LUMPUR: North Korea and Malaysia on Monday locked horns over the investigation into the killing of leader Kim Jong-Un’s brother, as footage emerged of the moment he was fatally attacked in a Kuala Lumpur airport.

PORTLAND, Maine — FairPoint Communications has asked regulators for permission to stop signing up new customers for regulated landline service in Scarborough, Gorham, Waterville, Kennebunk and Cape Elizabeth.

我试着用re把后面的半句分开,这是主句,比如

North Korea and Malaysia on Monday locked horns over the investigation into the killing of leader Kim Jong-Un’s brother, as footage emerged of the moment he was fatally attacked in a Kuala Lumpur airport.

我使用以下regrex将它们分开:

sep = re.split('-|:|--', sent)

但这并不是万能的,第二句话的结果是:

['PORTLAND, Maine \xe2\x80\x94 FairPoint Communications has asked regulators for permission to stop signing up new customers for regulated landline service in Scarborough, Gorham, Waterville, Kennebunk and Cape Elizabeth.']

unicode有什么关系吗?或者我需要在重新编码时传递不同格式的连字符?在

有没有一种通用的方法可以做得更好?在

谢谢。在


Tags: andoftheinforon字符over
2条回答

正如您所猜测的,问题在于字符串中存在unicode字符,因为没有一个ASCII字符具有与em破折号相同的值,PORTLAND, Maine — FairPoint Communications中的分隔符解释不好,变成了\xe2\x80\x94,而不是{}。在

有几个选项可以让您随心所欲:

  • 将源代码编码定义为unicode(将# -*- coding: utf-8 -*-设置为前两行中的任意一行),并将额外字符添加到正则表达式中。在
  • 您可以使用一个可用的库将字符串转换为ACSII(请参见convert a unicode string
  • 使用与re(sep = re.split(ur'-|:| |\u2014', sent))兼容的unicode正则表达式
  • 或者按照re documentation中的建议使用regex模块。在

因为第二个句子包含UNICODE字符,所以在执行代码之前需要define source code encoding,因为python的默认编码是ASCII。而且,你试图用错误的字符来吐出这个句子。它必须是(它是UNICODE)

pythondemo

#!/usr/bin/env python
# -*- coding: utf-8 -*-
import re
sent = "PORTLAND, Maine — FairPoint Communications has asked regulators for permission to stop signing up new customers for regulated landline service in Scarborough, Gorham, Waterville, Kennebunk and Cape Elizabeth."
sep = re.split('-|:|—', sent)
print sep

相关问题 更多 >

    热门问题