使用python访问html源代码中的不可见元素

2024-06-02 12:01:51 发布

您现在位置:Python中文网/ 问答频道 /正文

我正在尝试使用BeautifulSoup4和python 3删除此页面上所有插曲按钮(EP21221110,…)的链接。这是我用来检索网页源代码的代码:

from bs4 import BeautifulSoup
import requests as rq

webpage=rq.get('https://gogoanime.pe/category/boruto-naruto-next-generations').text
SourceCode=BeautifulSoup(webpage,'html.parser')
print(SourceCode.prettify())

问题是,我使用此python代码获得的源代码与我在浏览器上使用“inspect element”选项查看的源代码不同

首先,在我的浏览器中,我看到有一个标记:

<div id="load_ep"> <ul id="episode_related">

与家长:

<div class="anime_video_body" style="padding: 0 20px 20px 20px;">

包含我想要的所有插曲链接。但是,我在python代码的输出中看不到这个元素,因此无法访问这些元素

我想有权获得原始代码使用美丽的汤为了获得所有的链接,请教我如何才能做到这一点。 非常感谢您的帮助


Tags: 代码importdivid元素源代码链接浏览器
2条回答

这些链接的html由浏览器中运行的JavaScript生成。具体来说,它是名为:https://cdn.gogocdn.net/files/gogo/js/main.js?v=5.1的JS文件中详细说明的loadListEpisode函数的结果

在函数定义中,包含链接的html的请求url如下所示:

url: base_url_cdn_api + 'ajax/load-list-episode?ep_start=' + ep_start + '&ep_end=' + ep_end + '&id=' + id + '&default_ep=' + default_ep + '&alias=' + alias

您可以使用您拥有的html页面,自己动态构造该端点,然后解析出从请求响应到该端点的链接:

import requests, re
from bs4 import BeautifulSoup as bs

with requests.Session() as s:
    r = s.get('https://gogoanime.pe/category/boruto-naruto-next-generations')
    soup = bs(r.content, 'lxml')
    ep =  soup.select_one('.active[ep_start]')
    ep_start = ep['ep_start']
    ep_end = ep['ep_end']
    movie_id = soup.select_one('#movie_id')['value']
    alias = soup.select_one('#alias_anime')['value']
    base_url_cdn_api = re.search(r"base_url_cdn_api = '(.*?)'", r.text).group(1)
    default_ep = soup.select_one('#default_ep')['value']
    api_url = f'{base_url_cdn_api}ajax/load-list-episode?ep_start={ep_start}&ep_end= \
                {ep_end}&id={movie_id}&default_ep={default_ep}&alias={alias}'
    r = s.get(api_url)
    soup = bs(r.content, 'lxml')
    links = ['https://gogoanime.pe' + i['href'].strip() for i in soup.select('a')]
print(links)

试着这样做:

from selenium import webdriver
from bs4 import BeautifulSoup
from time import sleep

headers = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/92.0.4515.131 Safari/537.3"
}

url = 'https://gogoanime.pe/category/boruto-naruto-next-generations '

# keep simple and download from https://chromedriver.chromium.org/downloads (match version of Chrome installed)
# put file in same folder as the script
driver = webdriver.Chrome()
driver.get(url)

soup = BeautifulSoup(driver.page_source, "html.parser")
uls = soup.find_all("ul",id="episode_related")

for element in uls:
    for link in element.find_all('a'):
        print(element.find('a').text, link['href'])

输出:

EP 212
SUB
  /boruto-naruto-next-generations-episode-212
EP 212
SUB
  /boruto-naruto-next-generations-episode-211
EP 212......

相关问题 更多 >