Python re.findall发现了奇怪的错误模式

2024-09-30 14:31:01 发布

您现在位置:Python中文网/ 问答频道 /正文

我通常很好奇为什么re.findall会把sutch-weid当作查找空字符串、元组(这是什么意思)。似乎它不采取从句()正常,also解释|错误像ab | cd是(ab)|(cd),而不是一个(b | c)d你会认为正常。正因为如此,我无法定义我需要的正则表达式。
但在这个例子中,我们看到了简单模式中明显的错误行为:

([a-zA-Z0-9]+\.+)+[a-zA-Z0-9]{1,3}

是什么描述了gskinner.com这样的简单URL,www.capitolconnection.org 在https://regexr.com/的regex help中可以看到的内容,我使用re.findall识别:

hotmail.
living.
item.
2.
4S.

意思是字母。怎么会这样

我试图从文本中过滤出jonk的完整代码是:

import re

singles = r'[()\.\/$%=0-9,?!=; \t\n\r\f\v\":\[\]><]'


digits_str = singles + r'[()\-\.\/$%=0-9 \t\n\r\f\v\'\":\[\]]*'



#small_word = '[a-zA-Z0-9]{1,3}'

#junk_then_small_word = singles + small_word + '(' + singles + small_word + ')*'


email = singles + '\S+@\S*'






http_str = r'[^\.]+\.+[^\.]+\.+([^\.]+\.+)+?'

http = '(http|https|www)' + http_str

web_address = '([a-zA-Z0-9]+\.+)+[a-zA-Z0-9]{1,3}'


pat = email + '|' + digits_str

d_pat = re.compile(web_address)

text =  '''"Lucy Gonzalez" test-defis-wtf <stagecoachmama@hotmail.com> on 11/28/2000 01:02:22 PM
http://www.living.com/shopping/item/item.jhtml?.productId=LC-JJHY-2.00-10.4S.I will send checks
 directly to the vendor for any bills pre 4/20.  I will fax you copies.  I will also try and get the payphone transferred.

www.capitolconnection.org <http://www.capitolconnection.org>.

and/or =3D=3D=3D=3D=3D=3D=3D= O\'rourke'''


print('findall:')

for x in re.findall(d_pat,text):
    print(x)


print('split:')
for x in re.split(d_pat,text):
    print(x)

Tags: orgrecomhttpwwwwordsmallprint
2条回答

来自^{}的文档:

If one or more groups are present in the pattern, return a list of groups; this will be a list of tuples if the pattern has more than one group.

正则表达式有组,即括号中的部分。如果您想显示整个匹配,请将regex放在一个大组中(在整个对象周围放上括号),然后执行print(x[0])而不是print(x)

我猜我们的表达式必须在这里修改,这可能是问题所在,例如,如果我们希望匹配所需的模式,我们将从一个类似以下的表达式开始:

([a-zA-Z0-9]+)\.

如果我们希望在.之后有1到3个字符,我们可以将其扩展为:

([a-zA-Z0-9]+)\.([a-zA-Z0-9]{1,3})?

Demo 1

Demo 2

测试

# coding=utf8
# the above tag defines encoding for this document and is for Python 2.x compatibility

import re

regex = r"([a-zA-Z0-9]+)\.([a-zA-Z0-9]{1,3})?"

test_str = ("hotmail.\n"
    "living.\n"
    "item.\n"
    "2.\n"
    "4S.\n"
    "hotmail.com\n"
    "living.org\n"
    "item.co\n"
    "2.321\n"
    "4S.123")

matches = re.finditer(regex, test_str, re.MULTILINE)

for matchNum, match in enumerate(matches, start=1):

    print ("Match {matchNum} was found at {start}-{end}: {match}".format(matchNum = matchNum, start = match.start(), end = match.end(), match = match.group()))

    for groupNum in range(0, len(match.groups())):
        groupNum = groupNum + 1

        print ("Group {groupNum} found at {start}-{end}: {group}".format(groupNum = groupNum, start = match.start(groupNum), end = match.end(groupNum), group = match.group(groupNum)))

# Note: for Python 2.7 compatibility, use ur"" to prefix the regex and u"" to prefix the test string and substitution.

相关问题 更多 >