pythonselenium使用循环语句从网站列表fi获取每个网站的元素属性

2024-10-03 19:22:31 发布

您现在位置:Python中文网/ 问答频道 /正文

如何使用pythonselenium遍历网站列表(从Excel文件)以从每个网站获取值?在

例如,Excel文件中的列包含:

https://www.inc.com/profile/dom-&-tom
https://www.inc.com/profile/decksouth
https://www.inc.com/profile/shp-financial
and many more.....

我想从每个链接获取一个特定的HREF属性。在

目前我的代码:

^{pr2}$

如有任何意见,我们将不胜感激。在


Tags: 文件httpscom列表网站wwwprofileexcel
3条回答

要读取excel,请使用xlrd库。在sheet.cell_value(i, 0)中,irow索引,0是列索引。根据excel数据更改列索引。在

定义报废函数和返回值,必要时追加到另一个列表中。在您的例子中,您只是在打印,所以我返回None

import xlrd
from selenium import webdriver
# Give the location of the file


def scrapping(browser, links):

    browser.get(links)
    website_link_anchor = browser.find_element_by_xpath("//dd[@class='website']/a")
    actual_website_link = website_link_anchor.get_attribute("href")
    print(actual_website_link)
    return None


driver = webdriver.Chrome()

loc = ("path of file")

# To open Workbook
wb = xlrd.open_workbook(loc)
sheet = wb.sheet_by_index(0)
# links = []


for i in range(1, sheet.nrows):
    scrapping(driver, sheet.cell_value(i, 0))
    # links.append(sheet.cell_value(i, 0))

driver.close()

要循环浏览网站列表(从Excel文件)并从每个网站获取值,您需要:

  • 为你想浏览的网站创建一个列表。在
  • 然后调用每个网站并查找所需的元素。在
  • 打印实际的网站链接并再次循环。在
  • 始终在tearDown(){}方法中调用driver.quit(),以优雅地关闭并销毁WebDriverWeb Client实例。在
  • 您的示例代码将是:

    from selenium import webdriver
    from selenium.webdriver.chrome.options import Options
    
    myLinks = ['https://www.inc.com/profile/dom-&-tom', 'https://www.inc.com/profile/decksouth', 'https://www.inc.com/profile/shp-financial']
    
    options = Options()
    options.add_argument("start-maximized")
    options.add_argument("disable-infobars")
    options.add_argument(" disable-extensions")
    browser = webdriver.Chrome(chrome_options=options, executable_path=r'C:\path\to\chromedriver.exe')  
    for link in myLinks:
        browser.get(link)
        website_link_anchor = browser.find_element_by_xpath("//dd[@class='website']/a")
        actual_website_link = website_link_anchor.get_attribute("href")
        print(actual_website_link)
    browser.quit()
    

有什么改进我的代码的建议吗?在

from selenium import webdriver
from selenium.common.exceptions import NoSuchElementException
from selenium.webdriver.firefox.options import Options
import xlrd
import xlwt
from xlutils.copy import copy

def scraping(browser, link):
    returnValue = ""
    browser.get(link)
    try:
        website_link_anchor = browser.find_element_by_xpath("//dd[@class='website']/a")
        actual_website_link = website_link_anchor.get_attribute("href")
        returnValue = actual_website_link
    except NoSuchElementException: 
        returnValue = "Element not found for: " + link
    return returnValue

options = Options()
options.add_argument(" headless")
browser = webdriver.Firefox(firefox_options=options, executable_path=r'C:\WebDrivers\geckodriver.exe')

file_to_read = ("C:\INC5000\list.xlsx")

# read
file_to_read_wb = xlrd.open_workbook(file_to_read)
file_to_read_wb_sheet = file_to_read_wb.sheet_by_index(0)

# copy and write
file_to_write_to_wb = copy(file_to_read_wb)
file_to_write_to_wb_sheet = file_to_write_to_wb.get_sheet(0)

for i in range(1, file_to_read_wb_sheet.nrows):
    result = scraping(browser, file_to_read_wb_sheet.cell_value(i, 0))
    file_to_write_to_wb_sheet.write(i, 1, result)

file_to_write_to_wb.save("C:\INC5000\list2.xls")

browser.close()

相关问题 更多 >