无法抓取屏幕上看不到但属于滑块/carous一部分的数据

soup = BeautifulSoup(response, 'html.parser') divTag = soup.find_all("a", class_=['sc-VigVT', 'eJWBx']) for tag in divTag: tdTags = tag.find_all("h3", class_=['sc-jAaTju', 'iNsSAY']) for tag in tdTags: print(tag.text)

1条回答

网友

1楼 · 发布于 2024-05-19 18:18:52

carousel是使用JS中硬编码的JSON数据从Javascript生成的。确切地说，这个JSON是通过以下方式引入的：

window.__REDUX_STATE__= { ..... }

因此，据推测，这个站点使用redux来管理应用程序的状态

我们可以用以下脚本提取这个JSON：

import requests
from bs4 import BeautifulSoup
import json
import pprint

r = requests.get('https://yourstory.com/')

prefix = "window.__REDUX_STATE__="
soup = BeautifulSoup(r.content, "html.parser")

#get the redux state (json)
data = [
    json.loads(t.text[len(prefix):]) 
    for t in soup.find_all('script')
    if "__REDUX_STATE__" in t.text
]

#get only the section with cardType == "CarouselCard"
carouselCards = [
    t["data"]
    for t in data[0]["home"]["sections"]
    if ("cardType" in t) and (t["cardType"] == "CarouselCard")
][0]

#print all cards
pprint.pprint(carouselCards)

#get the name, image path & link path
print([
    (t["title"], t["path"], t["metadata"]["thumbnail"]) 
    for t in carouselCards
])

JSON在home字段中有一个sections数组。此节对象包括一些具有值为CarouselCard的cardType对象，其中包含您要查找的数据

另外，从JSON开始，Carousel部分如下所示：

{
    "type":"content",
    "dataAPI":"/api/v2/featured_stories?brand=yourstory&key=CURATED_SET",
    "dataAttribute":"featured",
    "cardType":"CarouselCard",
    "data":[]
}

所以我想你也可以使用API来获取卡片：https://yourstory.com/api/v2/featured_stories?brand=yourstory&key=CURATED_SET

import requests

r = requests.get('https://yourstory.com/api/v2/featured_stories?brand=yourstory&key=CURATED_SET')

#get the name, image path & link path
print([
    (t["title"], t["path"], t["metadata"]["thumbnail"]) 
    for t in r.json()["stories"]
])

哪个更直接

相关问题更多 >

编程相关推荐

热门问题

热门文章

无法抓取屏幕上看不到但属于滑块/carous一部分的数据

相关问题 更多 >

编程相关推荐

热门问题

热门文章

相关问题更多 >