在Python3中使用Selenium和Requests模块从网页获取文件问题的回答

在Python3中使用Selenium和Requests模块从网页获取文件

回答此问题可获得 20 贡献值，回答如果被采纳可获得 50 分。

我希望在我遇到的一个问题上能得到一些帮助。我对python相当陌生，一直在研究alsweigart的“用python自动处理无聊的东西”，试图简化一些非常乏味的工作。在 以下是我遇到的问题的概述：我试图访问一个网页，并使用Requests和BeautifulSoup模块来解析整个站点，获取指向所需文件的url，然后下载这些文件。除了一个小问题外，这个过程运行得很好……页面中有一个ReportDropDown选项，过滤显示的结果。我遇到的问题是，即使网页结果更新了新的信息，网页的网址没有改变，我的请求.get（）仅从默认筛选器获取信息。在 因此，为了解决这个问题，我尝试使用Selenium来更改报表选择……这也很有用，只是我无法从我打开的Selenium浏览器实例获取请求模块。在 所以看起来我可以使用Requests和BeautifulSoup来获取“default”页面下拉过滤器的信息，我可以使用Selenium来更改ReportDropDown选项，但是我不能将这两个东西结合起来。在 <hr/> 第1部分： <pre><code>#! python3 import os, requests, bs4 os.chdir('C:\\Standards') standardURL = 'http://www.nerc.net/standardsreports/standardssummary.aspx' res = requests.get(standardURL) res.raise_for_status() soup = bs4.BeautifulSoup(res.text, 'html.parser') # this is the url pattern when inspecting the elements on the page linkElems = soup.select('.style97 a') # I wanted to save the hyperlinks into a list splitStandards = [] for link in range(len(linkElems)): splitStandards.<a href="https://www.cnpython.com/list/append" class="inner-link">append</a>(linkElems[link].get('href')) # Next, I wanted to create the pdf's and copy them locally print(' STARTING STANDARDS DOWNLOAD '.center(80, '=') + '\n') for item in splitStandards: j = os.path.basename(item) # BAL-001-2.pdf, etc... f = open(j, 'wb') ires = requests.get(item) # http://www.nerc.com/pa/Stand/Reliability%20Standards/BAL-001-2.pdf ires.raise_for_status() for chunk in ires.iter_content(1000000): # 1MB chunks f.write(chunk) print('Completing download for: ' + str(j) + '.') f.close() print() print(' STANDARDS DOWNLOAD COMPLETE '.center(80, '=')) </code></pre> 这个模式工作得很好，除了我不能更改ReportDropDown选择，然后使用请求来获取新的页面信息。我已经修补过了请求.get(), 请求.post（url，data={}）、selenium请求等。。。在 <hr/> 第2部分： 使用硒似乎很简单，但我无法请求.get（）从正确的浏览器实例中提取。另外，我还必须制作一个Firefox配置文件（seleniumDefault），其中包含关于：配置更改。。。（windows+r，火狐.exe-p）一。更新：The关于：配置更改是暂时的browser.tabs.remote浏览器.autostart=真 ^{pr2}$ 所以，我的最终问题是，如何选择页面并为每个页面提取适当的数据？在 我的首选是只使用requests和bs4模块，但是如果我要使用selenium，那么如何从我打开的selenium浏览器实例获取请求呢？在 我已经尽我所能做到彻底，而且我对python还是相当陌生的，所以任何帮助都将不胜感激。另外，由于我还在学习很多这方面的知识，所以任何初级的中级解释都会很震撼，谢谢！在 ============================================================= 再次感谢你的帮助，它让我越过了阻挡我的墙。这是最终的产品…我必须添加一些睡眠语句，以便在获取信息之前完全加载所有内容。在 最终版本修订： <pre><code>#! python3 # _nercTest.py - Opens the nerc.net website and pulls down all # pdf's for the present, future, and inactive standards. import os, requests, bs4, time, datetime from selenium import webdriver from selenium.webdriver.common.by import By from selenium.webdriver.support.ui import WebDriverWait from selenium.webdriver.support import expected_conditions as EC from selenium.webdriver.support.select import Select os.chdir('C:\\Standards') def nercStandards(standardURL): logFile = open('_logFile.txt', 'w') logFile.write('Standard\t\tHyperlinks or Errors\t\t' + str(datetime.datetime.now().strftime("%m-%d-%Y %H:%M:%S")) + '\n\n') logFile.close() fp = webdriver.FirefoxProfile('C:\\pathto\\Firefox\\Profiles\\seleniumDefault') browser = webdriver.Firefox(fp) wait = WebDriverWait(browser, 10) currentOption = 'Mandatory Standards Subject to Enforcement' futureOption = 'Standards Subject to Future Enforcement' inactiveOption = 'Inactive Reliability Standards' dropdownList = [currentOption, futureOption, inactiveOption] print() print(' STARTING STANDARDS DOWNLOAD '.center(80, '=') + '\n') for option in dropdownList: standardName = [] # Capture all the standard names accurately standardLink = [] # Capture all the href links for each standard standardDict = {} # combine the standardName and standardLink into a dictionary browser.get(standardURL) dropdown = Select(browser.find_element_by_id("ReportDropDown")) dropdown.select_by_visible_text(option) wait.until(EC.text_to_be_present_in_element((By.CSS_SELECTOR, 'div > span[class="style12"]'), option)) time.sleep(3) # Needed for the 'inactive' page to completely load consistently page_source = browser.page_source soup = bs4.BeautifulSoup(page_source, 'html.parser') soupElems = soup.select('.style97 a') # standardLink list generated here for link in range(len(soupElems)): standardLink.append(soupElems[link].get('href')) # http://www.nerc.com/pa/Stand/Reliability%20Standards/BAL-001-2.pdf # standardName list generated here if option == currentOption: print(' Mandatory Standards Subject to Enforcement '.center(80, '.') + '\n') currentElems = soup.select('.style99 span[class="style30"]') for currentStandard in range(len(currentElems)): standardName.append(currentElems[currentStandard].getText()) # BAL-001-2 elif option == futureOption: print() print(' Standards Subject to Future Enforcement '.center(80, '.') + '\n') futureElems = soup.select('.style99 span[class="style30"]') for futureStandard in range(len(futureElems)): standardName.append(futureElems[futureStandard].getText()) # COM-001-3 elif option == inactiveOption: print() print(' Inactive Reliability Standards '.center(80, '.') + '\n') inactiveElems = soup.select('.style104 font[face="Verdana"]') for inactiveStandard in range(len(inactiveElems)): standardName.append(inactiveElems[inactiveStandard].getText()) # BAL-001-0 # if nunber of names and links match, then create key:value pairs in standardDict if len(standardName) == len(standardLink): for x in range(len(standardName)): standardDict[standardName[x]] = standardLink[x] else: print('Error: items in standardName and standardLink are not equal!') logFile = open('_logFile.txt', 'a') logFile.write('\nError: items in standardName and standardLink are not equal!\n') logFile.close() # URL correction for PRC-005-1b # if 'PRC-005-1b' in standardDict: # standardDict['PRC-005-1b'] = 'http://www.nerc.com/files/PRC-005-1.1b.pdf' for k, v in standardDict.items(): logFile = open('_logFile.txt', 'a') f = open(k + '.pdf', 'wb') ires = requests.get(v) try: ires.raise_for_status() logFile.write(k + '\t\t' + v + '\n') except Exception as exc: print('\nThere was a problem on %s: \n%s' % (k, exc)) logFile.write('There was a problem on %s: \n%s\n' % (k, exc)) for chunk in ires.iter_content(1000000): f.write(chunk) f.close() logFile.close() print(k + ': \n\t' + v) print() print(' STANDARDS DOWNLOAD COMPLETE '.center(80, '=')) nercStandards('http://www.nerc.net/standardsreports/standardssummary.aspx') </code></pre>

0 条评论
分类：Python问答

默认排序时间排序

1 个回答

匿名 1天前

　擅长：python、mysql、java

在Python3中使用Selenium和Requests模块从网页获取文件

1 个回答

相关Python问题