正则表达式返回true，带不正确的日语字符

>>> import re >>> regstr = ".*([0-9０-９]{3}[-ー]{1}[0-9０-９]{4}).*" >>> re.match( regstr, "this is obviously not going to work") >>> re.match( regstr, "this is going to work 123-4567") <_sre.SRE_Match object at 0x7fced8b485d0> >>> re.match( regstr, "this is going to work too １２３ー４５６７") <_sre.SRE_Match object at 0x7fced8b48648> >>> re.match( regstr, "This will not work, as it should not : 1234-567") >>> re.match( regstr, "This should not work, but it does : １２３４ー５６７") <_sre.SRE_Match object at 0x7fced8b48648> >>> re.match( regstr, "Now just seems crazy ....... 京都府") <_sre.SRE_Match object at 0x7fced8b485d0> >>> re.match( regstr, "京都府") <_sre.SRE_Match object at 0x7fced8b48648> >>> "京都府" '\xe4\xba\xac\xe9\x83\xbd\xe5\xba\x9c' >>> re.match( regstr, "\xe4\xba\xac\xe9\x83\xbd\xe5\xba\x9c") <_sre.SRE_Match object at 0x7fced8b48648>

# 'Japanese numbers' code >>> "０１２３４５６７８９ー" '\xef\xbc\x90\xef\xbc\x91\xef\xbc\x92\xef\xbc\x93\xef\xbc\x94\xef\xbc\x95\xef\xbc\x96\xef\xbc\x97\xef\xbc\x98\xef\xbc\x99\xe3\x83\xbc'

2条回答

网友

1楼 · 编辑于 2024-10-04 11:31:59

我刚刚在python3.6.6上尝试了您的测试，结果和预期的一样。我所做的唯一不同的事情是使用re.compile。看：

Python 3.6.6 (default, Jul 19 2018, 14:25:17) 
[GCC 8.1.1 20180712 (Red Hat 8.1.1-5)] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import re
>>> zipcode = re.compile(r'.*([0-9０-９]{3}[-ー]{1}[0-9０-９]{4}).*')
>>> zipcode.match("this is obviously not going to work")
>>> zipcode.match("this is going to work 123-4567")
<_sre.SRE_Match object; span=(0, 30), match='this is going to work 123-4567'>
>>> zipcode.match("this is going to work 123-4567").group(0)
'this is going to work 123-4567'
>>> zipcode.match("this is going to work 123-4567").group(1)
'123-4567'
>>> zipcode.match("this is going to work too １２３ー４５６７").group(1)
'１２３ー４５６７'
>>> zipcode.match("This should not work, but it does :  １２３４ー５６７").group(1)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
AttributeError: 'NoneType' object has no attribute 'group'
>>> zipcode.match("This should not work, but it does :  １２３４ー５６７")
>>> zipcode.match("Now just seems crazy ....... 京都府")
>>> zipcode.match("京都府")
>>>

编辑

到目前为止，我掌握的情况如下：

$ cat ziptest.py 
# -*- coding: utf-8 -*-
import re
zipcode = re.compile(r'.*([0-9０１２３４５６７８９]{3}[-ー]{1}[0-9０１２３４５６７８９]{4}).*')
tests = (
    "this is obviously not going to work",
    "this is going to work 123-4567",
    "this is going to work too １２３ー４５６７",
    "This will not work, as it should not :  1234-567",
    "This should not work, but it does :  １２３４ー５６７",
    "Now just seems crazy ....... 京都府",
    "京都府",
    "\xe4\xba\xac\xe9\x83\xbd\xe5\xba\x9c"
)

for test in tests:
    print('%s: %s' % (test, "Match" if zipcode.match(test) else "No match"))
$

结果如下：

$ python2.7 ziptest.py 
this is obviously not going to work: No match
this is going to work 123-4567: Match
this is going to work too １２３ー４５６７: Match
This will not work, as it should not :  1234-567: No match
This should not work, but it does :  １２３４ー５６７: Match
Now just seems crazy ....... 京都府: No match
京都府: No match
京都府: No match

$ python3.6 ziptest.py 
this is obviously not going to work: No match
this is going to work 123-4567: Match
this is going to work too １２３ー４５６７: Match
This will not work, as it should not :  1234-567: No match
This should not work, but it does :  １２３４ー５６７: No match
Now just seems crazy ....... 京都府: No match
京都府: No match
äº¬é½åº: No match

希望对你有帮助。你知道吗

网友

2楼 · 编辑于 2024-10-04 11:31:59

regstr = ".*([0-9０-９]{3}[-ー]{1}[0-9０-９]{4}).*"

在python3中，regstr将是包含一些非ascii字符的unicode字符串。在python2中，它是以某种编码方式编码的字符串，这取决于您在模块开头声明的内容（请参见PEP 263）以及实际用于保存文件的编码。为了避免这样的问题，我建议您永远不要在regex中使用unicode字符。这太难调试了。而不是逃离他们。你知道吗

字符0123456789是unicode字符'\uff10'到'\uff19'，所以我建议您应该这样使用它们。你知道吗

此外，如果您使用的是unicode正则表达式，那么应该使用unicode strings的u前缀来定义它：

regstr = u".*([0-9\uff10-\uff19]{3}[-\u30fc]{1}[0-9\uff10-\uff19]{4}).*"

稍后，当您将这个正则表达式与某个字符串匹配时，另一个字符串也应该是unicode字符串，而不是普通的str。为此，您必须知道输入的编码方式。例如，如果输入是utf-8，则使用：

input_string_as_unicode = unicode(input_string_as_utf8, 'utf-8')
re.match(regstr, input_string_as_unicode)

请注意，您可能已经有了作为unicode的输入，如果有一些框架支持您这样做的话。如果您不确定，请检查type(input_string)。你知道吗

相关问题更多 >

编程相关推荐

热门问题

热门文章