Unicode字符串的Python正则表达式标记化未按预期工作 - 问答 - Python中文网

Unicode字符串的Python正则表达式标记化未按预期工作

2024-09-24 02:15:18 发布

您现在位置：Python中文网/ 问答频道 /正文

男 | 程序猿一只，喜欢编程写python代码。

我在正则表达式标记化和Unicode字符串方面遇到了一个奇怪的问题。在

> mystring = "Unicode rägular expressions"
> tokens = re.findall(r'\w+', mystring, re.UNICODE)

我得到的是：

^{pr2}$

这就是我所期望的：

> print tokens
['Unicode', 'rägular', 'expressions']

我要做什么才能得到预期的结果？在

更新：这个问题和我的不同： matching unicode characters in python regular expressions但是它的答案https://stackoverflow.com/a/5028826/1251687也解决了我的问题。在

Tags：字符串标记 re unicode expressions print tokens matching

2条回答

网友

1楼 · 编辑于 2024-09-24 02:15:18

字符串必须是unicode。在

mystring = u"Unicode rägular expressions"
tokens = re.findall(r'\w+', mystring, re.UNICODE)

网友

2楼 · 编辑于 2024-09-24 02:15:18

您有拉丁语-1或Windows代码页1252字节，而不是Unicode文本。解码输入：

tokens = re.findall(r'\w+', mystring.decode('cp1252'), re.UNICODE)

编码字节可以表示任何内容，这取决于所使用的编解码器，它不是特定的Unicode码位。对于字节字符串（类型str），使用\w时只能匹配ASCII字符。在

相关问题更多 >

编程相关推荐

热门问题

热门文章