Python re和regex:search（）不匹配具有非ASCII字符的相同字符串

# python Python 2.7.3 (default, Apr 14 2012, 08:58:41) [GCC] on linux2 Type "help", "copyright", "credits" or "license" for more information. >>> import re >>> import regex >>> s1 = 'wow' >>> s2 = 'ℛℯα∂α♭ℓℯ ♭ʊ☂ η☺т Ѧ$☾ℐℐ' >>> print(s2) ℛℯα∂α♭ℓℯ ♭ʊ☂ η☺т Ѧ$☾ℐℐ >>> re.search(s1,s1) <_sre.SRE_Match object at 0x7f0ce27c38b8> >>> re.search(s2,s2) >>> type(s2) <type 'str'> >>> us2 = unicode(s2,'utf-8') >>> us2 u'\u211b\u212f\u03b1\u2202\u03b1\u266d\u2113\u212f \u266d\u028a\u2602 \u03b7\u263a\u0442 \u0466$\u263e\u2110\u2110' >>> re.search(us2,us2,re.UNICODE) >>> regex.search(s2,s2) >>> regex.search(us2,us2,regex.UNICODE) >>>

1条回答

网友

1楼 · 发布于 2024-10-01 00:18:34

注意，作为regex模式，s2内部有一个at at_end模式。你知道吗

In [62]: re.compile(s2, re.DEBUG)
literal 226
literal 132
literal 155
...
at at_end
...
literal 226
literal 132
literal 144

这是因为，作为utf-8编码字符串，s2是

In [61]: s2 = 'ℛℯα∂α♭ℓℯ ♭ʊ☂ η☺т Ѧ$☾ℐℐ'
In [72]: s2
Out[72]: '\xe2\x84\x9b\xe2\x84\xaf\xce\xb1\xe2\x88\x82\xce\xb1\xe2\x99\xad\xe2\x84\x93\xe2\x84\xaf \xe2\x99\xad\xca\x8a\xe2\x98\x82 \xce\xb7\xe2\x98\xba\xd1\x82 \xd1\xa6$\xe2\x98\xbe\xe2\x84\x90\xe2\x84\x90'

注意在s2中有一个$：

In [75]: '$' in s2
Out[75]: True

要防止$被解释为at at_end模式，请使用re.escape转义模式中的所有非字母数字字符：

In [67]: pat = re.compile(re.escape(s2))

In [68]: pat.search(s2)
Out[68]: <_sre.SRE_Match at 0x7feb6b44dd98>

转义unicode模式也是如此：

In [78]: us2 = unicode(s2,'utf-8')

In [79]: re.search(re.escape(us2), us2)
Out[79]: <_sre.SRE_Match at 0x7feb6b44ded0>

自

In [81]: u'$' in us2
Out[81]: True

相关问题更多 >

编程相关推荐

热门问题

热门文章