如何用Python抓取javascript网站？

方法2：硒+美容素组

tdy_url = "https://www.todayonline.com/" options = Options() options.headless = True driver = webdriver.Chrome("chromedriver",options=options) driver.get(tdy_url) time.sleep(10) html = driver.page_source soup = BeautifulSoup(html) soup.find_all('h3') ### Returns me only less than 1/4 of the 'h3' tags found in the original page source

请帮忙。我尝试过抓取其他新闻网站，这是如此容易。多谢各位

3条回答

网友

1楼 · 编辑于 2024-05-05 19:13:59

您可以通过API访问数据（查看网络选项卡）：

比如说,

import requests
url = "https://www.todayonline.com/api/v3/news_feed/7"
data = requests.get(url).json()

网友

2楼 · 编辑于 2024-05-05 19:13:59

您试图抓取的网站上的新闻数据是使用JavaScript（称为XHR -- XMLHttpRequest）从服务器获取的。它是在加载或滚动页面时动态发生的。因此，这些数据不会在服务器返回的页面中返回

在第一个示例中，您只获得服务器返回的页面——没有新闻，但是使用JS应该获得新闻。请求和BeautifulSoup都不能执行JS

但是，您可以尝试使用Python请求复制从服务器获取新闻标题的请求。执行以下步骤：

打开浏览器的DevTools（通常您必须按F12或Ctrl+Shift的组合键），然后查看从服务器获取新闻标题的请求。有时，它甚至比用BeautifulSoup刮网更容易。以下是一个屏幕截图（Firefox）：

复制请求链接（右键单击->；复制->；复制链接），并将其传递给requests.get(...)
获取请求的.json()。它将返回一个易于使用的dict。为了更好地理解dict的结构，我建议使用pprint而不是简单的打印。请注意，在使用它之前必须执行from pprint import pprint

下面是从页面上的主要新闻中获取标题的代码示例：

import requests


nodes = requests.get("https://www.todayonline.com/api/v3/news_feed/7")\
        .json()["nodes"]
for node in nodes:
    print(node["node"]["title"])

如果您想在标题下抓取一组新闻，您需要更改请求URL中news_feed/后的数字（要获得它，您只需要在DevTools中通过“news_feed”过滤请求并向下滚动新闻页面）

有时网站有防机器人程序的保护（尽管你试图抓取的网站没有）。在这种情况下，您可能还需要执行these steps

网友

3楼 · 编辑于 2024-05-05 19:13:59

我将向您推荐一种相当简单的方法

import requests
from bs4 import BeautifulSoup as bs

page = requests.get('https://www.todayonline.com/googlenews.xml').content
soup = bs(page)
news = [i.text for i in soup.find_all('news:title')]

print(news)

输出

['DBS named world’s best bank by New York-based financial publication',
 'Russia has very serious questions to answer on Navalny - UK',
 "Exclusive: 90% of China's Sinovac employees, families took coronavirus vaccine - CEO",
 'Three militants killed after fatal attack on policeman in Tunisia',
.....]

此外，如果需要，还可以查看XML页面以获取更多信息

p.S.在清理任何网站之前，始终检查合规性：）

方法一：靓汤

方法2：硒+美容素组

相关问题更多 >

编程相关推荐

热门问题

热门文章