使用Selenium和BeautifulSoup4动态刮取加载的Href属性

def Scrape_Udemy(): driver.get('https://couponscorpion.com/marketing/complete-guide-to-pinterest-pinterest-growth-2020/') content = driver.page_source soup = BeautifulSoup(content, 'html.parser') course_link = soup.find_all('div',{'class':"rh_button_wrapper"}) for i in course_link: link = i.find('a',href=True) if link is None: print('No Links Found') print(link['href'])

1条回答

网友

1楼 · 发布于 2024-06-01 14:34:49

两件事

有一个框允许在获取页面源代码之前单击该框
您的链接是span的直接子链接，而不是div

代码

import time
from selenium import webdriver
from bs4 import BeautifulSoup
driver = webdriver.Chrome(executable_path=r'c:\users\aaron\chromedriver.exe')
driver.get('https://couponscorpion.com/marketing/complete-guide-to-pinterest-pinterest-growth-2020/')
time.sleep(5)
driver.find_element_by_xpath('//button[@class="align-right primary slidedown-button"]').click()
content = driver.page_source
soup = BeautifulSoup(content, 'html.parser')
course_link = soup.find_all('span',{'class':"rh_button_wrapper"})
for i in course_link:
    link = i.find('a',href=True)
    if link is None:
        print('No Links Found')
    print(link['href'])

输出

https://couponscorpion.com/scripts/udemy/out.php?go=Q25aTzVXS1l0TXg1TExNZHE5a3pEUEM4SUxUZlBhWEhZWUwwd2FnS3RIVC96cE5lZEpKREdYcUFMSzZZaGlCM0V6RzF1eUE3aVJNaURZTFp5L0tKeVZ4dmRjOTcxN09WbVlKVXhOOGtIY2M9&s=e89c8d0358244e237e0e18df6b3fe872c1c1cd11&n=1298829005&a=0

解释

请始终查看执行driver.get()操作时发生的情况，有时在获取页面源代码之前需要单击一些框。必须进行所有浏览器活动

下面是使用XPATH选择器在该框中找到要单击的元素

//button[@class="align-right primary slidedown-button"]

这意味着

// - The entire DOM 
button - The HTML tag we want
[@class=""] - The HTML tag with class ""

在访问元素之前，我通常会花一些时间等待，加载此页面需要一段时间，通常需要添加一些等待，然后才能获得所需的元素或页面的一部分

有两种方法可以做到这一点，下面是使用模块时间的快速而肮脏的方法。有一些特定的方法可以使用selenium等待元素出现。实际上我试过了，但没能成功

有关值得了解的具体部分，请参见文档中的here和here

如果查看HTML，您将看到链接位于类rh_button_wrapper的span元素后面，而不是div

代码

输出

解释

相关问题更多 >

编程相关推荐

热门问题

热门文章