Selenium(python)在抓取多个页面时崩溃(50+)

2024-09-28 01:31:16 发布

您现在位置:Python中文网/ 问答频道 /正文

我有下面的脚本(如下)来清除掉跟踪的键数据(代码中的链接)。这是这个的一个修改版本:https://github.com/treatmesubj/FINRABondScrape

它可以在50-100页上运行,但当我尝试刮除所有~7600页时失败。脚本在此失败:bond = [tablerow.text],并引发以下错误:

StaleElementReferenceException:消息:的元素引用已过时;元素不再附加到DOM,不在当前帧上下文中,或者文档已刷新

我添加了一个显式的等待tablerows,认为有些表需要更长的时间才能加载,但它似乎没有帮助,因为问题仍然存在。我尝试过其他几种方法,但我已经没有主意了

任何关于如何解决这个问题的想法都会很有帮助。此外,欢迎提供任何加速代码的提示。谢谢

更新:以下来自KunduK+的建议将for循环中的time.sleep(0.8)增加到time.sleep(1.5)似乎解决了这个问题。但是,在接受昆都克的答案之前,我会等待一段时间,以防其他人想出更好的答案

# TRACE Bond Scraper
import os
import time
import numpy as np
import pandas as pd
from datetime import date
from datetime import datetime as dt
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import Select
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.firefox.options import Options

options = Options()
options.headless = False
driver = webdriver.Firefox(options = options)
driver.get('http://finra-markets.morningstar.com/BondCenter/Results.jsp')

# Click agree, edit search and submit 
WebDriverWait(driver, 10).until(EC.element_to_be_clickable(
    (By.CSS_SELECTOR, ".button_blue.agree"))).click()
WebDriverWait(driver, 10).until(EC.element_to_be_clickable(
    (By.CSS_SELECTOR, 'a.qs-ui-btn.blue'))).click()
WebDriverWait(driver, 10).until(EC.element_to_be_clickable(
    (By.CSS_SELECTOR, 'a.ms-display-switcher.hide'))).click()
WebDriverWait(driver, 10).until(EC.element_to_be_clickable(
    (By.CSS_SELECTOR, 'input.button_blue[type=submit]'))).click()
WebDriverWait(driver, 10).until(EC.presence_of_element_located(
    (By.CSS_SELECTOR, '.rtq-grid-row.rtq-grid-rzrow .rtq-grid-cell-ctn')))
headers = [title.text for title in driver.find_elements_by_css_selector(
    '.rtq-grid-row.rtq-grid-rzrow .rtq-grid-cell-ctn')[1:]]

# Find out the total number of pages to scrape
pg_no = WebDriverWait(driver, 10).until(EC.presence_of_element_located(
            (By.CSS_SELECTOR, '.qs-pageutil-total > span:nth-child(1)'))).text
pg_no = int(pg_no)

# Scrape tables
bonds = []
for page in range(1, pg_no):
    WebDriverWait(driver, 10).until(EC.presence_of_element_located(
        (By.CSS_SELECTOR, (f"a.qs-pageutil-btn.on[value='{str(page)}']"))))
    time.sleep(0.8)
    tablerows = WebDriverWait(driver, 10).until(EC.presence_of_all_elements_located(
        (By.CSS_SELECTOR, 'div.rtq-grid-bd > div.rtq-grid-row')))
    for tablerow in tablerows:
        bond = [tablerow.text]
        bonds.append(bond)
    WebDriverWait(driver, 10).until(EC.presence_of_element_located(
        (By.CSS_SELECTOR, ('a.qs-pageutil-next')))).click()

Tags: offromimportbydriverseleniumelementselector
1条回答
网友
1楼 · 发布于 2024-09-28 01:31:16

将此行从

WebDriverWait(driver, 10).until(EC.presence_of_element_located(
        (By.CSS_SELECTOR, (f"a.qs-pageutil-btn.on[value='{str(page)}']"))))

这将删除类on

WebDriverWait(driver, 10).until(EC.presence_of_element_located(
        (By.CSS_SELECTOR, (f"a.qs-pageutil-btn[value='{str(page)}']"))))

相关问题 更多 >

    热门问题