无法在Web Scraping Python中获取字段

2024-05-03 17:28:19 发布

您现在位置:Python中文网/ 问答频道 /正文

我正试图从下面的网站上获取所有公司名称(突出显示)。这是我的第一次网络抓取工作,所以我正在努力理解为什么我不能抓取公司名称,尽管我有正确的参数

headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/56.0.2924.76 Safari/537.36'} # This is chrome, you can set whatever browser you like
request = requests.get("https://www.hispanicmeetings.org", verify=False, headers=headers)
soup = bs4.BeautifulSoup(request.content)
soup.find_all("a href") # this is not getting me company names
soup.find_all('alt') #this either

我在网页上找到了html标签,并尝试了许多小组合,但似乎没有任何效果。任何将所有公司名称集中到一个地方的建议对我来说都意义重大


Tags: 网络名称you参数is网站request公司
2条回答

您没有使用BeautifulSoup正确引用正确的标记和/或属性。我建议找一个关于html的小教程来理解标记和属性,然后看看如何使用bs4选择它们。然后,您可以看到如何拉出标记,并从这些标记中拉出文本和/或属性值。请尝试以下代码:

import requests
import bs4

headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/56.0.2924.76 Safari/537.36'} # This is chrome, you can set whatever browser you like
request = requests.get("https://www.cloudtango.org", verify=False, headers=headers)
soup = bs4.BeautifulSoup(request.content, 'html.parser')
data = soup.find_all('td', {'class':'company'})

for each in data:
    print(each.find('img')['alt'])

输出:

Managed Solution
Redcentric
First Focus
K3 Technology
ICC Managed Services
AffinityMSP
BCA IT, Inc.
CloudCoCo Plc (formerly Adept4 PLC)
SCC
Datacom Systems
Compugen
Cancom
All Covered
Computacenter
q.beyond AG
Atos
Controlware GmbH Firmenzentrale
Trustmarque
Bytes
AHEAD
ACP IT Solutions GmbH
PROFI Engineering Systems AG
PQR
Orbit GmbH
SVA System Vertrieb Alexander GmbH
Ensono
Phoenix Software Ltd
Atea Norge AS
Axians
Kick ICT Group
Atea Sverige AB
Catapult Systems LLC
Valid

在类名为-Company的<td>标记中,公司名称作为alt标记的img属性出现

您正在使用soup.find_all('alt')-alt不是标记。 只能从soup对象中选择HTML标记,而不能从属性中选择

import requests
import bs4

headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/56.0.2924.76 Safari/537.36'} # This is chrome, you can set whatever browser you like
request = requests.get("https://www.cloudtango.org", verify=False, headers=headers)
soup = bs4.BeautifulSoup(request.content, 'html.parser')

t = soup.findAll('td', class_='company')

for i in t:
    print(i.find('img')['alt'])

Output:

First Focus
Managed Solution
centrexIT
Carbon60
Redcentric
BlackPoint IT Services
.
.
.

相关问题 更多 >