使用selenium的web scraping table只获取html元素,而不获取内容

2024-06-25 05:22:10 发布

您现在位置:Python中文网/ 问答频道 /正文

我试图使用selenium和beautifulsoup从这3个网站上删除表:

https://www.erstebank.hr/hr/tecajna-lista

https://www.otpbanka.hr/tecajna-lista

https://www.sberbank.hr/tecajna-lista/

对于所有3个网站,结果是表格的HTML代码,但没有文本

我的代码如下:

import requests
from bs4 import BeautifulSoup
import pyodbc
import datetime

from selenium import webdriver

PATH = r'C:\Users\xxxxxx\AppData\Local\chromedriver.exe'

driver = webdriver.Chrome(PATH)

driver.get('https://www.erstebank.hr/hr/tecajna-lista')

driver.implicitly_wait(10)

soup = BeautifulSoup(driver.page_source, 'lxml')

table = soup.find_all('table')

print(table)

driver.close()

请帮助我,我错过了什么

多谢各位


Tags: 代码fromhttpsimport网站wwwdriverselenium
3条回答

BeautifulSoup将找不到该表,因为该表从其引用点不存在。在这里,您告诉Selenium,如果它注意到一个元素还没有出现,就暂停Selenium驱动程序匹配程序:

# This only works for the Selenium element matcher
driver.implicitly_wait(10)

然后,紧接着,您获得当前HTML状态(表仍然不存在),并将其放入BeautifulSoup的解析器中。BS4将无法看到该表,即使它稍后加载,因为它将使用您刚才给它的当前HTML代码:

# You now move the CURRENT STATE OF THE HTML PAGE to BeautifulSoup's parser
soup = BeautifulSoup(driver.page_source, 'lxml')

# As this is now in BS4's hands, it will parse it immediately (won't wait 10 seconds)
table = soup.find_all('table')

# BS4 finds no tables as, when the page first loads, there are none.

要解决这个问题,您可以要求Selenium尝试获取HTML表本身。由于Selenium将使用您之前指定的implicitly_wait,因此它将等待它存在,然后才允许其余的代码执行持久化。此时,当BS4接收到HTML代码时,表将在那里

driver.implicitly_wait(10)

# Selenium will wait until the element is found
# I used XPath, but you can use any other matching sequence to get the table
driver.find_element_by_xpath("/html/body/div[2]/main/div/section/div[2]/div[1]/div/div/div/div/div/div/div[2]/div[6]/div/div[2]/table/tbody/tr[1]")

soup = BeautifulSoup(driver.page_source, 'lxml')

table = soup.find_all('table')

然而,这有点过分了。是的,您可以使用Selenium来解析HTML,但是您也可以使用requests模块(从您的代码中,我看到您已经导入了该模块)直接获取表数据

数据是从this端点异步加载的(您可以使用Chrome开发工具自己查找)。您可以将其与json模块配对,将其转换为格式良好的字典。这种方法不仅速度更快,而且资源密集度也低得多(Selenium必须打开整个浏览器窗口)

from requests import get
from json import loads

# Get data from URL
data_as_text = get("https://local.erstebank.hr/rproxy/webdocapi/fx/current").text

# Turn to dictionary
data_dictionary = loads(data_as_text)

网站正在花时间加载table中的数据

要么应用time.sleep

import time

driver.get('https://www.erstebank.hr/hr/tecajna-lista')
time.sleep(10)...

或者应用Explicit wait,以便将rows加载到tabel

import requests
from bs4 import BeautifulSoup

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.support.wait import WebDriverWait

driver = webdriver.Chrome(executable_path="path to chromedriver.exe")
driver.maximize_window()

driver.get('https://www.erstebank.hr/hr/tecajna-lista')

wait = WebDriverWait(driver,30)
wait.until(EC.presence_of_all_elements_located((By.XPATH,"//table/tbody/tr[@class='ng-scope']")))

# driver.find_element_by_id("popin_tc_privacy_button_2").click() # Cookie setting pop-up. Works fine even without dealing with this pop-up. 
soup = BeautifulSoup(driver.page_source, 'html5lib')

table = soup.find_all('table')

print(table)
<>你可以用这个作为进一步工作的基础:-

from bs4 import BeautifulSoup as BS
from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC

TDCLASS = 'ng-binding'

options = webdriver.ChromeOptions()
options.add_argument(' headless')
with webdriver.Chrome(options=options) as driver:
    driver.get('https://www.erstebank.hr/hr/tecajna-lista')
    try:
        # There may be a cookie request dialogue which we need to click through
        WebDriverWait(driver, 5).until(EC.presence_of_element_located(
            (By.ID, 'popin_tc_privacy_button_2'))).click()
    except Exception:
        pass  # Probably timed out so ignore on the basis that the dialogue wasn't presented
    # The relevant <td> elements all seem to be of class 'ng-binding' so look for those
    WebDriverWait(driver, 5).until(
        EC.presence_of_element_located((By.CLASS_NAME, TDCLASS)))
    soup = BS(driver.page_source, 'lxml')
    for td in soup.find_all('td', class_=TDCLASS):
        print(td)

相关问题 更多 >