使用Python/beauthoulsoup抓取网站为什么这个表没有返回？

2条回答

网友

1楼 · 编辑于 2024-09-29 19:29:32

通过在发送请求时检查站点的来源，可以看出站点是动态的。因此，最好使用浏览器操作工具，如selenium：

from bs4 import BeautifulSoup as soup 
from selenium import webdriver
import time
driver = webdriver.Chrome()
driver.get('http://www.worldlifeexpectancy.com/cause-of-death/alzheimers-dementia/by-country/')
countries = filter(None, [i.text for i in soup(driver.page_source, 'lxml').find_all('td', {'class':'hc_name'})])

输出：

^{pr2}$

网友

2楼 · 编辑于 2024-09-29 19:29:32

该表在页源中不可用。它是通过AJAX请求动态加载的。如果您查看Developer tools下的Network选项卡，那么AJAX请求将被发送到这个url-http://www.worldlifeexpectancy.com/j/country-cause?cause=95&order=hight。在

您可以看到数据以JSON格式提供。在内置的.json()函数的帮助下，您可以只使用requests模块来获取这些数据。在

您可以从这个JSON数据中获取所有数据，如排名、国家和费率。在

import requests

r = requests.get('http://www.worldlifeexpectancy.com/j/country-cause?cause=95&order=hight')
data = r.json()

for row in data['chart']['countries']['countryitem']:
    id_ = row['id']
    country = row['name']
    rank = row['rank']
    value = row['value']
    print(rank, id_, country, value)

部分输出：

^{pr2}$

另外，请记住，<tbody>元素在页面源代码中永远不可用。浏览器将插入它。因此，在抓取表时，不要在find()函数中使用tbody。见Why do browsers insert tbody element into table elements?。在

相关问题更多 >

编程相关推荐

热门问题

热门文章