为什么这个正则表达式贪婪,为什么示例代码永远重复?

2024-10-03 19:23:28 发布

您现在位置:Python中文网/ 问答频道 /正文

我想弄清楚这件事,真是疯了。已经三天了,我准备放弃了。下面的代码应返回剪贴板上所有电话号码和电子邮件的列表,不得重复

#! python 3
#! Phone number and email address scraper

#take user input for:
#1. webpage to scrape
# - user will be prompted to copy a link
#2. file & location to save to
#3. back to 1 or exit

import pyperclip, re, os.path

#function for locating phone numbers
def phoneNums(clipboard):
    phoneNums = re.compile(r'^(?:\d{8}(?:\d{2}(?:\d{2})?)?|\(\+?\d{2,3}\)\s?(?:\d{4}[\s*.-]?\d{4}|\d{3}[\s*.-]?\d{3}|\d{2}([\s*.-]?)\d{2}\1\d{2}(?:\1\d{2})?))$')
        #(\+\d{1,4})?                   #Optional country code (optional: +, 1-4 digits)
        #(\s)?                          #Optional space
        #(\(\d\))?                      #Optional bracketed area code
        #(\d\d(\s)?\d | \d{3})          #3 digits with optional space between
        #(\s)?                          #Optional space
        #(\d{3})                        #3 digits
        #(\s)?                          #Optional space
        #(\d{4})                        #Last four
        #)
        #)', re.VERBOSE)
    #nos = phoneNums.search(clipboard)  #ignore for now. Failed test of .group()

    return phoneNums.findall(clipboard)

#function for locating email addresses
def emails(clipboard):
    emails = re.compile(r'''(
        [a-z0-9._%+-]*     #username
        @                  #@ sign
        [a-z0-9.-]+        #domain name
        )''', re.I | re.VERBOSE)
    return emails.findall(clipboard)


#function for copying email addresses and numbers from webpage to a file
def scrape(fileName, saveLoc):
    newFile = os.path.join(saveLoc, fileName + ".txt")
    #file = open(newFile, "w+")
    #add phoneNums(currentText) +
    print(currentText)
    print(emails(currentText))
    print(phoneNums(currentText))
    #file.write(emails(currentText))
    #file.close()

url = ''
currentText = ''
file = ''
location =  ''

while True:
    print("Please paste text to scrape. Press ENTER to exit.")
    currentText = str(pyperclip.waitForNewPaste())
    #print("Filename?")
    #file = str(input())
    #print("Where shall I save this? Defaults to C:")
    #location = str(input())
    scrape(file, location)

电子邮件返回正确,但散列部分的电话号码输出如下:

[('+30 210 458 6600', '+30', ' ', '', '210', '', ' ', '458', ' ', '6600'), ('+30 210 458 6601', '+30', ' ', '', '210', '', ' ', '458', ' ', '6601')]

如您所见,数字被正确识别,但我的代码太贪婪了,所以我尝试添加“+?”:

def phoneNums(clipboard):
    phoneNums = re.compile(r'''(
        (\+\d{1,4})?                   #Optional country code (optional: +, 1-4 digits)
        (\s)?                          #Optional space
        (\(\d\))?                      #Optional bracketed area code
        (\d\d(\s)?\d | \d{3})          #3 digits with optional space between
        (\s)?                          #Optional space
        (\d{3})                        #3 digits
        (\s)?                          #Optional space
        (\d{4})                        #Last four
        )+?''', re.VERBOSE)

没有快乐。我尝试从这里插入一个正则表达式示例:Find phone numbers in python script

现在我知道这是有效的,因为其他人已经测试过了。我得到的是:

Please paste text to scrape. Press ENTER to exit. 
[] [] 
Please paste text to scrape. Press ENTER to exit. 
[] [('', '', '', '', '', '', '','', '', '')] 
...forever...

最后一个甚至不允许我复制到剪贴板上。waitForNewPaste()应该按照tin上的说明进行操作,但在我运行代码的那一刻,程序就会将剪贴板上的内容提取出来,并尝试对其进行处理(效果不佳)

很明显,我的代码中有个怪癖,但我看不到。有什么想法吗


Tags: to代码reforspacelocationoptionalfile
1条回答
网友
1楼 · 发布于 2024-10-03 19:23:28

正如你所指出的,正则表达式是有效的

输入部分“+30 210 458 6600”匹配一次,结果是所有捕获子组的元组:(“+30 210 458 6600”、“+30”、“210”、“458”、“6600”)

请注意,元组中的第一个元素是整个匹配项

如果通过在左括号后插入?:使所有组成为non-capturing,则将不会剩下任何捕获组,结果将只有作为str的完整匹配“+30210468600”

    phoneNums = re.compile(r'''
        (?:\+\d{1,4})?                   #Optional country code (optional: +, 1-4 digits)
        (?:\s)?                          #Optional space
        (?:\(\d\))?                      #Optional bracketed area code
        (?:\d\d(?:\s)?\d | \d{3})        #3 digits with optional space between
        (?:\s)?                          #Optional space
        (?:\d{3})                        #3 digits
        (?:\s)?                          #Optional space
        (?:\d{4})                        #Last four
        ''', re.VERBOSE)

代码“永远重复”,因为while True:块是infinite loop。如果你想在一次迭代后停止,你可以在块的末尾放一个break语句来停止循环

while True:
    currentText = str(pyperclip.waitForNewPaste())
    scrape(file, location)
    break

相关问题 更多 >