回答此问题可获得 20 贡献值,回答如果被采纳可获得 50 分。
<p>我希望在我遇到的一个问题上能得到一些帮助。
我对python相当陌生,一直在研究alsweigart的“用python自动处理无聊的东西”,试图简化一些非常乏味的工作。在</p>
<p>以下是我遇到的问题的概述:
我试图访问一个网页,并使用Requests和BeautifulSoup模块来解析整个站点,获取指向所需文件的url,然后下载这些文件。
除了一个小问题外,这个过程运行得很好……页面中有一个ReportDropDown选项,过滤显示的结果。我遇到的问题是,即使网页结果更新了新的信息,网页的网址没有改变,我的请求.get()仅从默认筛选器获取信息。在</p>
<p>因此,为了解决这个问题,我尝试使用Selenium来更改报表选择……这也很有用,只是我无法从我打开的Selenium浏览器实例获取请求模块。在</p>
<p>所以看起来我可以使用Requests和BeautifulSoup来获取“default”页面下拉过滤器的信息,我可以使用Selenium来更改ReportDropDown选项,但是我不能将这两个东西结合起来。在</p>
<hr/>
<p>第1部分:</p>
<pre><code>#! python3
import os, requests, bs4
os.chdir('C:\\Standards')
standardURL = 'http://www.nerc.net/standardsreports/standardssummary.aspx'
res = requests.get(standardURL)
res.raise_for_status()
soup = bs4.BeautifulSoup(res.text, 'html.parser')
# this is the url pattern when inspecting the elements on the page
linkElems = soup.select('.style97 a')
# I wanted to save the hyperlinks into a list
splitStandards = []
for link in range(len(linkElems)):
splitStandards.<a href="https://www.cnpython.com/list/append" class="inner-link">append</a>(linkElems[link].get('href'))
# Next, I wanted to create the pdf's and copy them locally
print(' STARTING STANDARDS DOWNLOAD '.center(80, '=') + '\n')
for item in splitStandards:
j = os.path.basename(item) # BAL-001-2.pdf, etc...
f = open(j, 'wb')
ires = requests.get(item)
# http://www.nerc.com/pa/Stand/Reliability%20Standards/BAL-001-2.pdf
ires.raise_for_status()
for chunk in ires.iter_content(1000000): # 1MB chunks
f.write(chunk)
print('Completing download for: ' + str(j) + '.')
f.close()
print()
print(' STANDARDS DOWNLOAD COMPLETE '.center(80, '='))
</code></pre>
<p>这个模式工作得很好,除了我不能更改ReportDropDown选择,然后使用请求来获取新的页面信息。我已经修补过了请求.get(), 请求.post(url,data={})、selenium请求等。。。在</p>
<hr/>
<p>第2部分:</p>
<p>使用硒似乎很简单,但我无法请求.get()从正确的浏览器实例中提取。另外,我还必须制作一个Firefox配置文件(seleniumDefault),其中包含关于:配置更改。。。(windows+r,火狐.exe-p) 一。
更新:The关于:配置更改是暂时的browser.tabs.remote浏览器.autostart=真</p>
^{pr2}$
<p>所以,我的最终问题是,如何选择页面并为每个页面提取适当的数据?在</p>
<p>我的首选是只使用requests和bs4模块,但是如果我要使用selenium,那么如何从我打开的selenium浏览器实例获取请求呢?在</p>
<p>我已经尽我所能做到彻底,而且我对python还是相当陌生的,所以任何帮助都将不胜感激。
另外,由于我还在学习很多这方面的知识,所以任何初级的中级解释都会很震撼,谢谢!在</p>
<p>=============================================================</p>
<p>再次感谢你的帮助,它让我越过了阻挡我的墙。
这是最终的产品…我必须添加一些睡眠语句,以便在获取信息之前完全加载所有内容。在</p>
<p>最终版本修订:</p>
<pre><code>#! python3
# _nercTest.py - Opens the nerc.net website and pulls down all
# pdf's for the present, future, and inactive standards.
import os, requests, bs4, time, datetime
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.support.select import Select
os.chdir('C:\\Standards')
def nercStandards(standardURL):
logFile = open('_logFile.txt', 'w')
logFile.write('Standard\t\tHyperlinks or Errors\t\t' +
str(datetime.datetime.now().strftime("%m-%d-%Y %H:%M:%S")) + '\n\n')
logFile.close()
fp = webdriver.FirefoxProfile('C:\\pathto\\Firefox\\Profiles\\seleniumDefault')
browser = webdriver.Firefox(fp)
wait = WebDriverWait(browser, 10)
currentOption = 'Mandatory Standards Subject to Enforcement'
futureOption = 'Standards Subject to Future Enforcement'
inactiveOption = 'Inactive Reliability Standards'
dropdownList = [currentOption, futureOption, inactiveOption]
print()
print(' STARTING STANDARDS DOWNLOAD '.center(80, '=') + '\n')
for option in dropdownList:
standardName = [] # Capture all the standard names accurately
standardLink = [] # Capture all the href links for each standard
standardDict = {} # combine the standardName and standardLink into a dictionary
browser.get(standardURL)
dropdown = Select(browser.find_element_by_id("ReportDropDown"))
dropdown.select_by_visible_text(option)
wait.until(EC.text_to_be_present_in_element((By.CSS_SELECTOR, 'div > span[class="style12"]'), option))
time.sleep(3) # Needed for the 'inactive' page to completely load consistently
page_source = browser.page_source
soup = bs4.BeautifulSoup(page_source, 'html.parser')
soupElems = soup.select('.style97 a')
# standardLink list generated here
for link in range(len(soupElems)):
standardLink.append(soupElems[link].get('href'))
# http://www.nerc.com/pa/Stand/Reliability%20Standards/BAL-001-2.pdf
# standardName list generated here
if option == currentOption:
print(' Mandatory Standards Subject to Enforcement '.center(80, '.') + '\n')
currentElems = soup.select('.style99 span[class="style30"]')
for currentStandard in range(len(currentElems)):
standardName.append(currentElems[currentStandard].getText())
# BAL-001-2
elif option == futureOption:
print()
print(' Standards Subject to Future Enforcement '.center(80, '.') + '\n')
futureElems = soup.select('.style99 span[class="style30"]')
for futureStandard in range(len(futureElems)):
standardName.append(futureElems[futureStandard].getText())
# COM-001-3
elif option == inactiveOption:
print()
print(' Inactive Reliability Standards '.center(80, '.') + '\n')
inactiveElems = soup.select('.style104 font[face="Verdana"]')
for inactiveStandard in range(len(inactiveElems)):
standardName.append(inactiveElems[inactiveStandard].getText())
# BAL-001-0
# if nunber of names and links match, then create key:value pairs in standardDict
if len(standardName) == len(standardLink):
for x in range(len(standardName)):
standardDict[standardName[x]] = standardLink[x]
else:
print('Error: items in standardName and standardLink are not equal!')
logFile = open('_logFile.txt', 'a')
logFile.write('\nError: items in standardName and standardLink are not equal!\n')
logFile.close()
# URL correction for PRC-005-1b
# if 'PRC-005-1b' in standardDict:
# standardDict['PRC-005-1b'] = 'http://www.nerc.com/files/PRC-005-1.1b.pdf'
for k, v in standardDict.items():
logFile = open('_logFile.txt', 'a')
f = open(k + '.pdf', 'wb')
ires = requests.get(v)
try:
ires.raise_for_status()
logFile.write(k + '\t\t' + v + '\n')
except Exception as exc:
print('\nThere was a problem on %s: \n%s' % (k, exc))
logFile.write('There was a problem on %s: \n%s\n' % (k, exc))
for chunk in ires.iter_content(1000000):
f.write(chunk)
f.close()
logFile.close()
print(k + ': \n\t' + v)
print()
print(' STANDARDS DOWNLOAD COMPLETE '.center(80, '='))
nercStandards('http://www.nerc.net/standardsreports/standardssummary.aspx')
</code></pre>