BeautifulSoup不接受完整的HTML代码

import requests from bs4 import BeautifulSoup url = 'https://pokedex.org/' html = BeautifulSoup(requests.get(url).content,'lxml') uls = html.find('ul', attrs = {'id':'monsters-list'}) print(uls.prettify())

2条回答

网友

1楼 · 编辑于 2024-06-01 09:30:56

页面是动态加载的，因此requests不支持它。我们可以使用Selenium作为刮取页面的替代方法，并且还需要向下滚动页面

安装时使用：pip install selenium

从here下载正确的ChromeDriver。以下是代码：

from bs4 import BeautifulSoup
from selenium import webdriver
import time

url = 'https://pokedex.org/'
webdriver = webdriver.Chrome()
webdriver.get(url)
time.sleep(2)

webdriver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
time.sleep(5)
html = BeautifulSoup(webdriver.page_source,'lxml')

uls = html.find('ul', attrs = {'id':'monsters-list'})

print(uls.prettify())

输出最后一项：

<li style="background: linear-gradient(90deg, #B8B8D0 50%, #A8B820 50%)">
  <button class="monster-sprite sprite-649" type="button">
  </button>
  <span>
   Genesect
  </span>
 </li>

网友

2楼 · 编辑于 2024-06-01 09:30:56

看起来元素是由JavaScript创建的，但请求无法处理JavaScript动态生成的元素。（如果我错了，请纠正我）

我建议使用selenium和ChromeWebDriver来获取页面源代码，然后可以使用BeautifulSoup进行解析

（假设您使用chrome浏览器）

访问：chrome://settings/help并检查您的chrome版本
从官方网站下载chromewebdriver的合适版本（https://chromedriver.chromium.org/downloads）
将chromedriver.exe和python文件放在同一目录中

最后我们来看看代码

import requests
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.chrome.options import Options

# headless background execution
Options = Options()
Options.headless = True

url = "https://pokedex.org/"
browser = webdriver.Chrome(options=Options)
browser.get(url)

html = BeautifulSoup(requests.get(url).content, 'lxml')
uls = html.find('ul', attrs={'id': 'monsters-list'})

print(uls.prettify())

相关问题更多 >

编程相关推荐

热门问题

热门文章