无法在Web Scraping Python中获取字段

headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/56.0.2924.76 Safari/537.36'} # This is chrome, you can set whatever browser you like request = requests.get("https://www.hispanicmeetings.org", verify=False, headers=headers) soup = bs4.BeautifulSoup(request.content) soup.find_all("a href") # this is not getting me company names soup.find_all('alt') #this either

2条回答

网友

1楼 · 编辑于 2024-05-03 17:28:19

您没有使用BeautifulSoup正确引用正确的标记和/或属性。我建议找一个关于html的小教程来理解标记和属性，然后看看如何使用bs4选择它们。然后，您可以看到如何拉出标记，并从这些标记中拉出文本和/或属性值。请尝试以下代码：

import requests
import bs4

headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/56.0.2924.76 Safari/537.36'} # This is chrome, you can set whatever browser you like
request = requests.get("https://www.cloudtango.org", verify=False, headers=headers)
soup = bs4.BeautifulSoup(request.content, 'html.parser')
data = soup.find_all('td', {'class':'company'})

for each in data:
    print(each.find('img')['alt'])

输出：

Managed Solution
Redcentric
First Focus
K3 Technology
ICC Managed Services
AffinityMSP
BCA IT, Inc.
CloudCoCo Plc (formerly Adept4 PLC)
SCC
Datacom Systems
Compugen
Cancom
All Covered
Computacenter
q.beyond AG
Atos
Controlware GmbH Firmenzentrale
Trustmarque
Bytes
AHEAD
ACP IT Solutions GmbH
PROFI Engineering Systems AG
PQR
Orbit GmbH
SVA System Vertrieb Alexander GmbH
Ensono
Phoenix Software Ltd
Atea Norge AS
Axians
Kick ICT Group
Atea Sverige AB
Catapult Systems LLC
Valid

网友

2楼 · 编辑于 2024-05-03 17:28:19

在类名为-Company的<td>标记中，公司名称作为alt标记的img属性出现

您正在使用soup.find_all('alt')-alt不是标记。只能从soup对象中选择HTML标记，而不能从属性中选择

import requests
import bs4

headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/56.0.2924.76 Safari/537.36'} # This is chrome, you can set whatever browser you like
request = requests.get("https://www.cloudtango.org", verify=False, headers=headers)
soup = bs4.BeautifulSoup(request.content, 'html.parser')

t = soup.findAll('td', class_='company')

for i in t:
    print(i.find('img')['alt'])

Output:

First Focus
Managed Solution
centrexIT
Carbon60
Redcentric
BlackPoint IT Services
.
.
.

相关问题更多 >

编程相关推荐

热门问题

热门文章