无法在Python中使用Selenium抓取正确数量的视频和图像

2024-10-16 11:27:29 发布

您现在位置:Python中文网/ 问答频道 /正文

我在上面搜索了很多,但大多数答案都无法解决我的问题:

  • 我想刮的是:

    • 我有一个link。该链接重定向到动态网站
    • 我想得到的视频数量和图像数量居住在这个链接
    • 我想使用bs4、Selenium和Python来实现它
  • 我面临的问题是:

    • 当我选中“inspect元素”并执行简单的Ctrl+F操作以查找视频标签时。我可以看到适量的视频。但是,当我打开同一页的“查看源代码”时,我只能看到一个视频标签

此外,当我尝试抓取时,我只能检索到一个视频。我不知道为什么bs4没有检测到其他视频标签。我假设这与Javascript加载的页面有关。但是,即使使用下面的代码,使用Selenium,我仍然无法获得正确数量的视频和图像

这是我尝试过的代码:

driver = webdriver.Chrome()
driver.get("https://www.kickstarter.com/projects/evolutionwear/fast-solar-charging-that-fits-in-your-pocket/?ref=kicktraq")
res = driver.execute_script('return document.documentElement.outerHTML')
soup = BeautifulSoup(res, 'html.parser')

driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")

c=1
for vidL in soup.find_all("div", {'class': 'play_button_container absolute-center has_played_hide'}):
    print(vidL)
    print(c)
    c+=1

Tags: 代码in图像execute数量视频链接driver
2条回答

由于数据是由javascripts呈现的,所以在使用BeautifulSoup之前需要等待元素可见

代码

from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
from bs4 import BeautifulSoup
from selenium import webdriver
driver = webdriver.Chrome()
driver.get("https://www.kickstarter.com/projects/evolutionwear/fast-solar-charging-that-fits-in-your-pocket/?ref=kicktraq")
WebDriverWait(driver,10).until(EC.visibility_of_all_elements_located((By.CSS_SELECTOR,".play_button_container")))
res = driver.page_source
soup = BeautifulSoup(res, 'html.parser')
c=1
for vidL in soup.find_all("div", {'class': 'play_button_container absolute-center has_played_hide'}):
    print(vidL)
    print(c)
    c+=1

控制台上的输出:

<div class="play_button_container absolute-center has_played_hide">
<button aria-label="Play video" class="play_button_big play_button_dark radius2px" type="button">
<span aria-hidden="true" class="ksr-icon__play"></span>
Play
</button>
</div>
1
<div class="play_button_container absolute-center has_played_hide">
<button aria-label="Play video" class="play_button_big play_button_dark radius2px" type="button">
<span aria-hidden="true" class="ksr-icon__play"></span>
Play
</button>
</div>
2
<div class="play_button_container absolute-center has_played_hide">
<button aria-label="Play video" class="play_button_big play_button_dark radius2px" type="button">
<span aria-hidden="true" class="ksr-icon__play"></span>
Play
</button>
</div>
3
<div class="play_button_container absolute-center has_played_hide">
<button aria-label="Play video" class="play_button_big play_button_dark radius2px" type="button">
<span aria-hidden="true" class="ksr-icon__play"></span>
Play
</button>
</div>
4
<div class="play_button_container absolute-center has_played_hide">
<button aria-label="Play video" class="play_button_big play_button_dark radius2px" type="button">
<span aria-hidden="true" class="ksr-icon__play"></span>
Play
</button>
</div>
5

要打印视频的数量,您需要为visibility_of_all_elements_located()导入WebDriverWait,您可以使用以下任一Locator Strategies

  • 使用CSS_SELECTOR

    driver.get("https://www.kickstarter.com/projects/evolutionwear/fast-solar-charging-that-fits-in-your-pocket/?ref=kicktraq")
    print(len(WebDriverWait(driver, 5).until(EC.visibility_of_all_elements_located((By.CSS_SELECTOR, "div.play_button_container.absolute-center.has_played_hide")))))
    
  • 使用XPATH

    driver.get("https://www.kickstarter.com/projects/evolutionwear/fast-solar-charging-that-fits-in-your-pocket/?ref=kicktraq")
    print(len(WebDriverWait(driver, 5).until(EC.visibility_of_all_elements_located((By.XPATH, "//div[@class='play_button_container absolute-center has_played_hide']")))))
    
  • 控制台输出:

    5
    

相关问题 更多 >