我似乎无法处理python中regex（re.search）的空白结果，我要么得到重复的结果，要么没有结果？

weblinks=[] email=[] page = requests.get('https://www.ourcommons.ca/Parliamentarians/en/members?view=ListAll') soup = BeautifulSoup(page.content, 'lxml') for ln in soup.select(".personName > a"): weblinks.append("https://www.ourcommons.ca" + ln.get('href')) if(len(weblinks)==10): break

提取电子邮件

for elnk in weblinks: pagedet = requests.get(elnk) soupdet = BeautifulSoup(pagedet.content, 'lxml') for ln1 in soupdet.select(".caucus > a"): mat = re.search(r'mailto:\w*\.\w*@parl.gc.ca',ln1.get('href')) if mat: email.append(mat.group()) else: email.append("No Email Found") print("Len Email:",len(email))

预期结果：显示电子邮件的网页有一个和空白的网页没有

1条回答

网友

1楼 · 发布于 2024-09-28 15:38:19

如果检查页面DOM有two similar elements present，这就是为什么您会得到多个值。您需要设置条件来消除这些值。请尝试下面的代码

weblinks=[]
email=[]

page = requests.get('https://www.ourcommons.ca/Parliamentarians/en/members?view=ListAll')
soup = BeautifulSoup(page.content, 'lxml')


for ln in soup.select(".personName > a"):
    weblinks.append("https://www.ourcommons.ca" + ln.get('href'))
    if(len(weblinks)==10):
        break


for elnk in weblinks:
    pagedet = requests.get(elnk)
    soupdet = BeautifulSoup(pagedet.content, 'lxml')
    if len(soupdet.select(".caucus > a"))> 1:
       for ln1 in soupdet.select(".caucus > :not(a[target])"):
          mat = re.search(r'mailto:\w*\.\w*@parl.gc.ca',ln1.get('href'))
          if mat:
            email.append(mat.group())
          else:
            email.append("No Email Found")
    else:
       for ln1 in soupdet.select(".caucus > a"):
         mat = re.search(r'mailto:\w*\.\w*@parl.gc.ca', ln1.get('href'))
         if mat:
             email.append(mat.group())
         else:
             email.append("No Email Found")

print(email)
print("Len Email:",len(email))

输出：

['mailto:Ziad.Aboultaif@parl.gc.ca', 'mailto:Dan.Albas@parl.gc.ca', 'mailto:harold.albrecht@parl.gc.ca', 'mailto:John.Aldag@parl.gc.ca', 'mailto:Omar.Alghabra@parl.gc.ca', 'mailto:Leona.Alleslev@parl.gc.ca', 'mailto:dean.allison@parl.gc.ca', 'No Email Found', 'No Email Found', 'mailto:Gary.Anand@parl.gc.ca']

Len Email: 10

提取电子邮件

相关问题更多 >

编程相关推荐

热门问题

热门文章