无法定位和捕获某些非结构化html中的几个字段

2024-10-02 20:38:41 发布

您现在位置:Python中文网/ 问答频道 /正文

我试图使用BeautifulSoup库从webpage中挖掘出四个字段。很难单独识别字段,这就是我寻求帮助的原因

有时两封电子邮件都存在,但情况并非总是如此。在本例中,我使用索引来捕获电子邮件,但这肯定是最糟糕的想法。此外,通过以下尝试,我只能解析电子邮件的标题,而不能解析电子邮件地址

我尝试过(最低工作示例):

from bs4 import BeautifulSoup

html = """
  <p>
   <strong>
    Robert Romanoff
   </strong>
   <br/>
   146 West 29th Street, Suite 11W
   <br/>
   New York, New York 10001
   <br/>
   Telephone: (718) 527-1577
   <br/>
   Fax: (718) 276-8501
   <br/>
   Email:
   <a href="mailto:robert@absol.com">
    robert@absol.com
   </a>
   <br/>
   Additional Contact: William Locantro
   <br/>
   Email:
   <a href="mailto:bill@absol.com">
    bill@absol.com
   </a>
  </p>
"""
soup = BeautifulSoup(html,"lxml")
container = soup.select_one("p")
contact_name = container.strong.text.strip()
contact_email = [i for i in container.strings if "Email" in i][0].strip()
additional_contact = [i.strip() for i in container.strings if "Additional Contact" in i.strip()][0].strip('Additional Contact:')
additional_email = [i for i in container.strings if "Email" in i][1].strip()
print(contact_name,contact_email,additional_contact,additional_email)

电流输出:

Robert Romanoff Email: William Locantro Email:

预期产出:

Robert Romanoff robert@absol.com William Locantro bill@absol.com

Tags: inbrcom电子邮件emailcontainercontactrobert
3条回答

这里有一个解决方案,你可以试一试

import re

soup = BeautifulSoup(html, "lxml")

names_ = [
    soup.select_one("p > strong").text.strip(),
    soup.find(text=re.compile("Additional Contact:")).replace('Additional Contact:', '').strip()
]

email_ = [i.strip() for i in soup.find_all(text=re.compile("absol"))]

print(" ".join(i + " " + j for i, j in zip(names_, email_)))

Robert Romanoff robert@absol.com William Locantro bill@absol.com

对于更复杂的html/xml解析,您应该看看xpath,它允许非常强大的选择器规则

在python中,它在parsel包中提供

from parsel import Selector

html = '...'
sel = Selector(html)
name = sel.xpath('//strong[1]/text()').get().strip()
email = sel.xpath("//text()[re:test(., 'Email')]/following-sibling::a/text()").get().strip()
name_additional = sel.xpath("//text()[re:test(., 'Additional Contact')]").re("Additional Contact: (.+)")[0]
email_additional = sel.xpath("//text()[re:test(., 'Additional Contact')]/following-sibling::a/text()").get().strip()
print(name, email, name_additional, email_additional)
# Robert Romanoff robert@absol.com William Locantro bill@absol.com

你可以这样做

  • 选择包含所需数据的<div>
  • 创建上述选定<div>中存在的数据列表
  • 迭代列表并提取所需的数据

代码如下:

from bs4 import BeautifulSoup
import requests

url = 'http://www.nyeca.org/find-a-contractor-by-name/'
r = requests.get(url)
soup = BeautifulSoup(r.text,"lxml")

d = soup.find_all('div', class_='sabai-directory-body')
for i in d:
    x = i.text.strip().split('\n')
    data = [x[0].strip()]
    for item in x:
        if item.startswith('Email'):
            data.append(item.split(':')[1].strip())
        elif item.startswith('Additional'):
            data.append(item.split(':')[1].strip())
    print(data)

提供承包商详细信息和其他详细信息(如有)的列表

['Ron Singh', 'rsingh@atechelectric.com']
['George Pacacha', 'Office@agvelectricalservices.com']
['Andrew Drazic', 'ADrazic@atjelectrical.com']
['Albert Barbato', 'Abarbato@abelectriccorp.com']
['Ralph Sica', 'Ralph.Sica@abm.com', 'Henry Kissinger', 'Henry.Kissinger@abm.com']
['Robert Romanoff', 'robert@absoluteelectric.com', 'William Locantro', 'bill@absoluteelectric.com']
.
.

相关问题 更多 >