在Python3中使用Selenium和Requests模块从网页获取文件

2024-09-24 16:34:18 发布

您现在位置:Python中文网/ 问答频道 /正文

我希望在我遇到的一个问题上能得到一些帮助。 我对python相当陌生,一直在研究alsweigart的“用python自动处理无聊的东西”,试图简化一些非常乏味的工作。在

以下是我遇到的问题的概述: 我试图访问一个网页,并使用Requests和BeautifulSoup模块来解析整个站点,获取指向所需文件的url,然后下载这些文件。 除了一个小问题外,这个过程运行得很好……页面中有一个ReportDropDown选项,过滤显示的结果。我遇到的问题是,即使网页结果更新了新的信息,网页的网址没有改变,我的请求.get()仅从默认筛选器获取信息。在

因此,为了解决这个问题,我尝试使用Selenium来更改报表选择……这也很有用,只是我无法从我打开的Selenium浏览器实例获取请求模块。在

所以看起来我可以使用Requests和BeautifulSoup来获取“default”页面下拉过滤器的信息,我可以使用Selenium来更改ReportDropDown选项,但是我不能将这两个东西结合起来。在


第1部分:

#! python3
import os, requests, bs4
os.chdir('C:\\Standards')
standardURL = 'http://www.nerc.net/standardsreports/standardssummary.aspx'
res = requests.get(standardURL)
res.raise_for_status()
soup = bs4.BeautifulSoup(res.text, 'html.parser')

# this is the url pattern when inspecting the elements on the page
linkElems = soup.select('.style97 a')

# I wanted to save the hyperlinks into a list
splitStandards = []
for link in range(len(linkElems)):
    splitStandards.append(linkElems[link].get('href'))

# Next, I wanted to create the pdf's and copy them locally
print(' STARTING STANDARDS DOWNLOAD '.center(80, '=') + '\n')
for item in splitStandards:
    j = os.path.basename(item)      # BAL-001-2.pdf, etc...
    f = open(j, 'wb')
    ires = requests.get(item)
    # http://www.nerc.com/pa/Stand/Reliability%20Standards/BAL-001-2.pdf
    ires.raise_for_status()
    for chunk in ires.iter_content(1000000):    # 1MB chunks
        f.write(chunk)
    print('Completing download for: ' + str(j) + '.')
    f.close()
print()
print(' STANDARDS DOWNLOAD COMPLETE '.center(80, '='))

这个模式工作得很好,除了我不能更改ReportDropDown选择,然后使用请求来获取新的页面信息。我已经修补过了请求.get(), 请求.post(url,data={})、selenium请求等。。。在


第2部分:

使用硒似乎很简单,但我无法请求.get()从正确的浏览器实例中提取。另外,我还必须制作一个Firefox配置文件(seleniumDefault),其中包含关于:配置更改。。。(windows+r,火狐.exe-p) 一。 更新:The关于:配置更改是暂时的browser.tabs.remote浏览器.autostart=真

^{pr2}$

所以,我的最终问题是,如何选择页面并为每个页面提取适当的数据?在

我的首选是只使用requests和bs4模块,但是如果我要使用selenium,那么如何从我打开的selenium浏览器实例获取请求呢?在

我已经尽我所能做到彻底,而且我对python还是相当陌生的,所以任何帮助都将不胜感激。 另外,由于我还在学习很多这方面的知识,所以任何初级的中级解释都会很震撼,谢谢!在

=============================================================

再次感谢你的帮助,它让我越过了阻挡我的墙。 这是最终的产品…我必须添加一些睡眠语句,以便在获取信息之前完全加载所有内容。在

最终版本修订:

#! python3

# _nercTest.py - Opens the nerc.net website and pulls down all
# pdf's for the present, future, and inactive standards.

import os, requests, bs4, time, datetime
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.support.select import Select

os.chdir('C:\\Standards')

def nercStandards(standardURL):
    logFile = open('_logFile.txt', 'w')
    logFile.write('Standard\t\tHyperlinks or Errors\t\t' +
                  str(datetime.datetime.now().strftime("%m-%d-%Y %H:%M:%S")) + '\n\n')
    logFile.close()
    fp = webdriver.FirefoxProfile('C:\\pathto\\Firefox\\Profiles\\seleniumDefault')
    browser = webdriver.Firefox(fp)
    wait = WebDriverWait(browser, 10)

    currentOption = 'Mandatory Standards Subject to Enforcement'
    futureOption = 'Standards Subject to Future Enforcement'
    inactiveOption = 'Inactive Reliability Standards'

    dropdownList = [currentOption, futureOption, inactiveOption]

    print()
    print(' STARTING STANDARDS DOWNLOAD '.center(80, '=') + '\n')
    for option in dropdownList:
        standardName = []   # Capture all the standard names accurately
        standardLink = []   # Capture all the href links for each standard
        standardDict = {}   # combine the standardName and standardLink into a dictionary 
        browser.get(standardURL)
        dropdown = Select(browser.find_element_by_id("ReportDropDown"))
        dropdown.select_by_visible_text(option)
        wait.until(EC.text_to_be_present_in_element((By.CSS_SELECTOR, 'div > span[class="style12"]'), option))

        time.sleep(3)   # Needed for the 'inactive' page to completely load consistently
        page_source = browser.page_source
        soup = bs4.BeautifulSoup(page_source, 'html.parser')
        soupElems = soup.select('.style97 a')

        # standardLink list generated here
        for link in range(len(soupElems)):
            standardLink.append(soupElems[link].get('href'))
            # http://www.nerc.com/pa/Stand/Reliability%20Standards/BAL-001-2.pdf

        # standardName list generated here
        if option == currentOption:
            print(' Mandatory Standards Subject to Enforcement '.center(80, '.') + '\n')
            currentElems = soup.select('.style99 span[class="style30"]')
            for currentStandard in range(len(currentElems)):
                   standardName.append(currentElems[currentStandard].getText())
                   # BAL-001-2
        elif option == futureOption:
            print()
            print(' Standards Subject to Future Enforcement '.center(80, '.') + '\n')
            futureElems = soup.select('.style99 span[class="style30"]')
            for futureStandard in range(len(futureElems)):
                   standardName.append(futureElems[futureStandard].getText())
                   # COM-001-3       
        elif option == inactiveOption:
            print()
            print(' Inactive Reliability Standards '.center(80, '.') + '\n')
            inactiveElems = soup.select('.style104 font[face="Verdana"]')
            for inactiveStandard in range(len(inactiveElems)):
                   standardName.append(inactiveElems[inactiveStandard].getText())
                   # BAL-001-0

        # if nunber of names and links match, then create key:value pairs in standardDict
        if len(standardName) == len(standardLink):
            for x in range(len(standardName)):
                standardDict[standardName[x]] = standardLink[x]
        else:
            print('Error: items in standardName and standardLink are not equal!')
            logFile = open('_logFile.txt', 'a')
            logFile.write('\nError: items in standardName and standardLink are not equal!\n')
            logFile.close()

        # URL correction for PRC-005-1b
        # if 'PRC-005-1b' in standardDict:
        #     standardDict['PRC-005-1b'] = 'http://www.nerc.com/files/PRC-005-1.1b.pdf'

        for k, v in standardDict.items():
            logFile = open('_logFile.txt', 'a')
            f = open(k + '.pdf', 'wb')
            ires = requests.get(v)
            try:
                ires.raise_for_status()
                logFile.write(k + '\t\t' + v + '\n')
            except Exception as exc:
                print('\nThere was a problem on %s: \n%s' % (k, exc))
                logFile.write('There was a problem on %s: \n%s\n' % (k, exc))
            for chunk in ires.iter_content(1000000):
                    f.write(chunk)
            f.close()
            logFile.close()
            print(k + ': \n\t' + v)
    print()
    print(' STANDARDS DOWNLOAD COMPLETE '.center(80, '='))

nercStandards('http://www.nerc.net/standardsreports/standardssummary.aspx')

Tags: thetoinimportforgetlenselenium
2条回答

@HenryM是正确的,除了在您读取.page_source并将其传递给BeautifulSoup进行进一步解析之前,您需要确保您所需的数据已加载到那里。为此,请使用^{} class。在

例如,在您选择了“Standards Filted and Pending Regulatory Approval”选项后,您需要等待报告标题更新—这将指示您已加载新结果。大致如下:

from bs4 import BeautifulSoup

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.support.select import Select

# ...

wait = WebDriverWait(browser, 10)

option_text = "Standards Filed and Pending Regulatory Approval" 

# select the dropdown value
dropdown = Select(browser.find_element_by_id("ReportDropDown"))
dropdown.select_by_visible_text(option_text)

# wait for results to be loaded
wait.until(EC.text_to_be_present_in_element((By.CSS_SELECTOR, "#panel5 > div > span"), option_text)

soup = BeautifulSoup(browser.page_source,'html.parser')
# TODO: parse the results

还要注意使用^{} class来操作下拉列表。在

使用Selenium点击按钮等完成工作后,您需要告诉BeautifulSoup使用它:

    page_source = browser.page_source
    link_soup = bs4.BeautifulSoup(page_source,'html.parser')

相关问题 更多 >