无法使用BeautifulSoup和Requests刮取下拉菜单 - 问答 - Python中文网

无法使用BeautifulSoup和Requests刮取下拉菜单

2024-06-01 11:28:07 发布

您现在位置：Python中文网/ 问答频道 /正文

男 | 程序猿一只，喜欢编程写python代码。

我想从百年灵网站上的产品页面中搜寻各种信息。你知道吗

示例页：https://www.breitling.com/gb-en/watches/navitimer/b01-chronograph-46/AB0127211C1A1/

我有困难刮手表的表带材料在下拉菜单上面的“添加到包”按钮（'钢1.4435'在例子的情况下）。你知道吗

我想要的具体元素是：

<small class="dd-selected-description dd-desc dd-selected-description-truncated">Steel 1.4435</small>

但是，这不会在对GET请求的响应中返回。距离<small>标记最近的元素是带有id='strap-selector-list'的<div>元素。你知道吗

但是，当调用soup.find(id='strap-selector-list')时，它将<div>显示为不包含任何内容。你知道吗

import requests
from bs4 import BeautifulSoup

url = 'https://www.breitling.com/gb-en/watches/navitimer/b01-chronograph-46/AB0127211C1A1/'

r = requests.get(url)
soup = BeautifulSoup(r.text, 'lxml')

soup.find(id='strap-selector-list')

退货

<div id="strap-selector-list"></div>

我怎样才能得到里面的信息（当你打开检查器时会显示出来？）你知道吗

Screenshot of page with inspector open highlighting areas of interest

我试过的：

欺骗邮件头。我在开发人员工具的网络选项卡中复制/粘贴了所有请求头（除了cookie）。我在GET请求中使用了它们（为了简洁起见，只包括更改的行）

headers = {'accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3',
'accept-encoding': 'gzip, deflate, br',
'accept-language': 'en-GB,en-US;q=0.9,en;q=0.8',
'cache-control': 'max-age=0',
'dnt': '1',
'referer': 'https://www.breitling.com/gb-en/watches/navitimer/?search%5Bref%5D=&search%5Bsorting%5D=newest',
'sec-fetch-mode': 'navigate, same-origin, cors',
'sec-fetch-site': 'same-origin',
'sec-fetch-user': '?1',
'upgrade-insecure-requests': '1',
'user-agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/77.0.3865.90 Safari/537.36',
'X-Requested-With': 'XMLHttpRequest'
}

r = requests.get(url, headers=headers)

已检查XHR请求。页面加载时只有3个。一个是关于收银台的状态，一个是关于零售商的信息，比如他们的商店位置，另一个是状态.php它给出了404错误。你知道吗
如果单击下拉菜单，则不会发送XHR请求。你知道吗
如果单击下拉菜单中的任何项目，则会转到该项目的产品页。
使用不同的解析器，例如。html.parser语法分析器没有区别
将cookies添加到头中并执行正常的GET请求，也没有区别
首先创建session = requests.Session()并在有headers=headers和没有headers=headers的情况下执行r = session.get(url)也不起作用。你知道吗

非常感谢您的帮助！你知道吗

Tags： https div com 信息 id url www requests

1条回答

网友

1楼 · 发布于 2024-06-01 11:28:07

您要查找的数据位于script元素下。你知道吗

您所需要做的就是加载作为脚本主体返回的JSON并遍历dict

import requests
from bs4 import BeautifulSoup
import json
import pprint

url = 'https://www.breitling.com/gb-en/watches/navitimer/b01-chronograph-46/AB0127211C1A1/'

r = requests.get(url)
soup = BeautifulSoup(r.text, 'html')

script = soup.find(id='app-reference-versions')
pprint.pprint(json.loads(script.contents[0]))

输出

https://pastebin.com/kGhMQt61

相关问题更多 >

编程相关推荐

热门问题

热门文章