我希望在我遇到的一个问题上能得到一些帮助。 我对python相当陌生,一直在研究alsweigart的“用python自动处理无聊的东西”,试图简化一些非常乏味的工作。在
以下是我遇到的问题的概述: 我试图访问一个网页,并使用Requests和BeautifulSoup模块来解析整个站点,获取指向所需文件的url,然后下载这些文件。 除了一个小问题外,这个过程运行得很好……页面中有一个ReportDropDown选项,过滤显示的结果。我遇到的问题是,即使网页结果更新了新的信息,网页的网址没有改变,我的请求.get()仅从默认筛选器获取信息。在
因此,为了解决这个问题,我尝试使用Selenium来更改报表选择……这也很有用,只是我无法从我打开的Selenium浏览器实例获取请求模块。在
所以看起来我可以使用Requests和BeautifulSoup来获取“default”页面下拉过滤器的信息,我可以使用Selenium来更改ReportDropDown选项,但是我不能将这两个东西结合起来。在
第1部分:
#! python3
import os, requests, bs4
os.chdir('C:\\Standards')
standardURL = 'http://www.nerc.net/standardsreports/standardssummary.aspx'
res = requests.get(standardURL)
res.raise_for_status()
soup = bs4.BeautifulSoup(res.text, 'html.parser')
# this is the url pattern when inspecting the elements on the page
linkElems = soup.select('.style97 a')
# I wanted to save the hyperlinks into a list
splitStandards = []
for link in range(len(linkElems)):
splitStandards.append(linkElems[link].get('href'))
# Next, I wanted to create the pdf's and copy them locally
print(' STARTING STANDARDS DOWNLOAD '.center(80, '=') + '\n')
for item in splitStandards:
j = os.path.basename(item) # BAL-001-2.pdf, etc...
f = open(j, 'wb')
ires = requests.get(item)
# http://www.nerc.com/pa/Stand/Reliability%20Standards/BAL-001-2.pdf
ires.raise_for_status()
for chunk in ires.iter_content(1000000): # 1MB chunks
f.write(chunk)
print('Completing download for: ' + str(j) + '.')
f.close()
print()
print(' STANDARDS DOWNLOAD COMPLETE '.center(80, '='))
这个模式工作得很好,除了我不能更改ReportDropDown选择,然后使用请求来获取新的页面信息。我已经修补过了请求.get(), 请求.post(url,data={})、selenium请求等。。。在
第2部分:
使用硒似乎很简单,但我无法请求.get()从正确的浏览器实例中提取。另外,我还必须制作一个Firefox配置文件(seleniumDefault),其中包含关于:配置更改。。。(windows+r,火狐.exe-p) 一。 更新:The关于:配置更改是暂时的browser.tabs.remote浏览器.autostart=真
^{pr2}$所以,我的最终问题是,如何选择页面并为每个页面提取适当的数据?在
我的首选是只使用requests和bs4模块,但是如果我要使用selenium,那么如何从我打开的selenium浏览器实例获取请求呢?在
我已经尽我所能做到彻底,而且我对python还是相当陌生的,所以任何帮助都将不胜感激。 另外,由于我还在学习很多这方面的知识,所以任何初级的中级解释都会很震撼,谢谢!在
=============================================================
再次感谢你的帮助,它让我越过了阻挡我的墙。 这是最终的产品…我必须添加一些睡眠语句,以便在获取信息之前完全加载所有内容。在
最终版本修订:
#! python3
# _nercTest.py - Opens the nerc.net website and pulls down all
# pdf's for the present, future, and inactive standards.
import os, requests, bs4, time, datetime
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.support.select import Select
os.chdir('C:\\Standards')
def nercStandards(standardURL):
logFile = open('_logFile.txt', 'w')
logFile.write('Standard\t\tHyperlinks or Errors\t\t' +
str(datetime.datetime.now().strftime("%m-%d-%Y %H:%M:%S")) + '\n\n')
logFile.close()
fp = webdriver.FirefoxProfile('C:\\pathto\\Firefox\\Profiles\\seleniumDefault')
browser = webdriver.Firefox(fp)
wait = WebDriverWait(browser, 10)
currentOption = 'Mandatory Standards Subject to Enforcement'
futureOption = 'Standards Subject to Future Enforcement'
inactiveOption = 'Inactive Reliability Standards'
dropdownList = [currentOption, futureOption, inactiveOption]
print()
print(' STARTING STANDARDS DOWNLOAD '.center(80, '=') + '\n')
for option in dropdownList:
standardName = [] # Capture all the standard names accurately
standardLink = [] # Capture all the href links for each standard
standardDict = {} # combine the standardName and standardLink into a dictionary
browser.get(standardURL)
dropdown = Select(browser.find_element_by_id("ReportDropDown"))
dropdown.select_by_visible_text(option)
wait.until(EC.text_to_be_present_in_element((By.CSS_SELECTOR, 'div > span[class="style12"]'), option))
time.sleep(3) # Needed for the 'inactive' page to completely load consistently
page_source = browser.page_source
soup = bs4.BeautifulSoup(page_source, 'html.parser')
soupElems = soup.select('.style97 a')
# standardLink list generated here
for link in range(len(soupElems)):
standardLink.append(soupElems[link].get('href'))
# http://www.nerc.com/pa/Stand/Reliability%20Standards/BAL-001-2.pdf
# standardName list generated here
if option == currentOption:
print(' Mandatory Standards Subject to Enforcement '.center(80, '.') + '\n')
currentElems = soup.select('.style99 span[class="style30"]')
for currentStandard in range(len(currentElems)):
standardName.append(currentElems[currentStandard].getText())
# BAL-001-2
elif option == futureOption:
print()
print(' Standards Subject to Future Enforcement '.center(80, '.') + '\n')
futureElems = soup.select('.style99 span[class="style30"]')
for futureStandard in range(len(futureElems)):
standardName.append(futureElems[futureStandard].getText())
# COM-001-3
elif option == inactiveOption:
print()
print(' Inactive Reliability Standards '.center(80, '.') + '\n')
inactiveElems = soup.select('.style104 font[face="Verdana"]')
for inactiveStandard in range(len(inactiveElems)):
standardName.append(inactiveElems[inactiveStandard].getText())
# BAL-001-0
# if nunber of names and links match, then create key:value pairs in standardDict
if len(standardName) == len(standardLink):
for x in range(len(standardName)):
standardDict[standardName[x]] = standardLink[x]
else:
print('Error: items in standardName and standardLink are not equal!')
logFile = open('_logFile.txt', 'a')
logFile.write('\nError: items in standardName and standardLink are not equal!\n')
logFile.close()
# URL correction for PRC-005-1b
# if 'PRC-005-1b' in standardDict:
# standardDict['PRC-005-1b'] = 'http://www.nerc.com/files/PRC-005-1.1b.pdf'
for k, v in standardDict.items():
logFile = open('_logFile.txt', 'a')
f = open(k + '.pdf', 'wb')
ires = requests.get(v)
try:
ires.raise_for_status()
logFile.write(k + '\t\t' + v + '\n')
except Exception as exc:
print('\nThere was a problem on %s: \n%s' % (k, exc))
logFile.write('There was a problem on %s: \n%s\n' % (k, exc))
for chunk in ires.iter_content(1000000):
f.write(chunk)
f.close()
logFile.close()
print(k + ': \n\t' + v)
print()
print(' STANDARDS DOWNLOAD COMPLETE '.center(80, '='))
nercStandards('http://www.nerc.net/standardsreports/standardssummary.aspx')
@HenryM是正确的,除了在您读取} class 。在
.page_source
并将其传递给BeautifulSoup
进行进一步解析之前,您需要确保您所需的数据已加载到那里。为此,请使用^{例如,在您选择了“Standards Filted and Pending Regulatory Approval”选项后,您需要等待报告标题更新—这将指示您已加载新结果。大致如下:
还要注意使用^{} class 来操作下拉列表。在
使用Selenium点击按钮等完成工作后,您需要告诉BeautifulSoup使用它:
相关问题 更多 >
编程相关推荐