使用python+BeautifulSoup进行web抓取时处理br标记

2024-10-05 13:15:47 发布

您现在位置:Python中文网/ 问答频道 /正文

在下面的例子中,我只能单独提取span标记中的内容,在本例中,span标记是公司(companyOrg)的名称,但当我获得br标记中显示的数据并增加属性address、neighbory、citystatezip、country、phone、fax、email、website、membertype等时,我会继续提取吗

这是我的代码:

import requests
from requests import get
from bs4 import BeautifulSoup
import pandas as pd
import numpy as np
import re

url = "https://netforum.avectra.com/eweb/DynamicPage.aspx?Site=IFPUG&WebCode=OrgResult&FromSearchControl=Yes"
results = requests.get(url)
soup = BeautifulSoup(results.text, 'lxml')

# Organization Atributes
company = []
adress = []
neighborhood = []
citystatezip = []
country = []
phone = []
fax = []
email = []
website = []
membertype = []

organ_div = soup.find_all('td', class_='PadLeft10')

for container in organ_div:    

    comp = container.span.text
    company.append(comp)

    #How do I retrieve the information below that is in the br tags?

    #adress
    #neighborhood
    #citystatezip
    #country
    #phone
    #fax
    #email
    #website
    #membertype

这是Organize_div中包含的html块的示例:

<td align="left" class="PadLeft10" style="padding-left: 15px;" valign="top" width="96%">
   <span style="font-weight: bold;">
      <br/>Mw Solucoes Ltda.
   </span>
   <br/>Rua do apolo 45 Ed. Centrale Caixa Postal 07
   <br/>Recife Antigo
   <br/>Recife, Pernambuco 52021568
   <br/>Brazil
   <br/>Phone: 08133304578
   <br/>Fax: 08133304579
   <br/>E-mail: <a href="mailto:mw@mwsolucoes.com.br">mw@mwsolucoes.com.br</a>
   <br/>Web Site: <a href="http://www.mwsolucoes.com.br" target="_blank">http://www.mwsolucoes.com.br</a>
   <br/>Member Type: Regular Corporate
   <br/>
   <a class="DataFormHyperLink" href="javascript:OpenNewWindow('DemographicsShow.aspx?FormKey=d4f4f6c9-a460-49f5-b09d-7e642ce1c9b1&amp;Title='+escape('MW Solucoes')+'&amp;Key=22631F73-11EC-415A-B2E4-A3D1DD42B7BE');" title="Click here for more information">» View More Info</a>
   <br/>
   <br/>
</td>

Tags: 标记brimportdivcomemailphonewebsite

热门问题