相同的CSS，不同的结果在浏览器和bs4.select（）方法中

import requests from bs4 import BeautifulSoup import lxml url = 'https://web.archive.org/web/19990421025223/http://www.rbc.ru' headers = { 'user-agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/53.0.2785.143 Safari/537.36' } r = requests.get(url, headers=headers) soup = BeautifulSoup(r.content, 'lxml') selector = 'body > table:nth-of-type(2) > tbody:nth-of-type(1)>tr:nth-of-type(1)>td:nth-of-type(5)>table:nth-of-type(1)>tbody:nth-of-type(1)' print(soup.select(selector=selector))

3条回答

网友

1楼 · 编辑于 2024-06-01 12:23:44

您的代码中有两个问题，首先，在BeautifulSoup中如果您想使用CSS选择器，符号+ > ~需要用space分隔，如果您想修补bs4，请参阅here。你知道吗

第二，正如我之前对your questions的回答，页面源代码中没有tbody，它是由浏览器生成的。你知道吗

这里修复了CSS选择器

selector = 'body > table:nth-of-type(2) > tr:nth-of-type(1) > td:nth-of-type(5) > table:nth-of-type(1)'

网友

2楼 · 编辑于 2024-06-01 12:23:44

您不能期望浏览器生成的选择器在BeautifulSoup中可靠地工作，因为当在浏览器中呈现页面时，标记会更改，而当您在Python代码中下载页面时，没有呈现，您只会得到非常初始的未呈现HTML页面。你知道吗

在这里，您必须使用自己的CSS选择器或其他方法来定位table元素。你知道吗

由于页面的标记不是真正的HTML解析友好型，我将通过它的列名之一来定位table元素：

table = soup.find("b", text="спрос").find_parent("table")

请注意，它只在我使用宽松的^{} parser解析页面时对我有效：

soup = BeautifulSoup(response.content, "html5lib")

网友

3楼 · 编辑于 2024-06-01 12:23:44

由于在运行时javascript可以以不同于源代码的方式呈现整个页面，bs4不适合动态变化的网站。你知道吗

我建议使用Selenium，因为它实际上打开了网站，并且允许您在呈现某些元素之前暂停搜索。如果您不想看到弹出的浏览器，也有其他无头浏览器库以静默方式模拟浏览器环境。你知道吗

相关问题更多 >

编程相关推荐

热门问题

热门文章