我正试着从人道协会立法基金会的网站上下载表格。以下代码成功地从其中一个页面获取数据:
import time
import pandas as pd
from selenium import webdriver
from webdriver_manager.chrome import ChromeDriverManager
browser = webdriver.Chrome(ChromeDriverManager().install())
browser.get('https://hslf.org/scorecards/2007-senate-midterm')
time.sleep(10)
html = browser.page_source
humane_sc_tables = pd.read_html(html)
humane_sc_data = humane_sc_tables[0]
我现在需要通过多个URL循环,并将每个网页结果导出到csv文件中
import time
import pandas as pd
from selenium import webdriver
from selenium.common import exceptions
from webdriver_manager.chrome import ChromeDriverManager
# browser = webdriver.Chrome(ChromeDriverManager().install())
URL_list = ["https://hslf.org/scorecards/2007-senate-midterm",
"https://hslf.org/scorecards/2008-senate-final",
"https://hslf.org/scorecards/2008-house-final",
"https://hslf.org/scorecards/2009-senate-midterm",
"https://hslf.org/scorecards/2009-house-midterm",
"https://hslf.org/scorecards/2010-house-final",
"https://hslf.org/scorecards/2010-senate-final",
"https://hslf.org/scorecards/2011-house-midterm",
"https://hslf.org/scorecards/2011-senate-midterm",
"https://hslf.org/scorecards/2012-house-final",
"https://hslf.org/scorecards/2012-senate-final",
"https://hslf.org/scorecards/2013-house-midterm",
"https://hslf.org/scorecards/2013-senate-midterm",
"https://hslf.org/scorecards/2014-house-final",
"https://hslf.org/scorecards/2014-senate-final",
"https://hslf.org/scorecards/2015-house-midterm",
"https://hslf.org/scorecards/2015-senate-midterm",
"https://hslf.org/scorecards/2016-house-final",
"https://hslf.org/scorecards/2016-senate-final",
"https://hslf.org/scorecards/2017-house-midterm",
"https://hslf.org/scorecards/2017-senate-midterm",
"https://hslf.org/scorecards/2018-house-final",
"https://hslf.org/scorecards/2018-senate-final"]
for url in URL_list:
browser = webdriver.Chrome(ChromeDriverManager().install())
time.sleep(5)
print("Current session is {}".format(browser.session_id))
browser.quit()
try:
browser.get(url)
except exceptions.InvalidSessionIdException as e:
print(e.message)
html = browser.page_source
humane_sc_tables = pd.read_html(html)
humane_sc_data = humane_sc_tables[0]
humane_sc_data = humane_sc_data.drop(humane_sc_data.columns[[0,5,7]], axis = 1)
browser.close()
humane_sc_data.to_csv(f'humane_scores{url}.csv')
但是,我得到以下错误:
MaxRetryError: HTTPConnectionPool(host='127.0.0.1', port=55494): Max retries exceeded with url: /session/7e430735b2d015147dc20049f3b78b10/url (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f9c018aa210>: Failed to establish a new connection: [Errno 61] Connection refused'))
请告知
让它发挥作用。请参阅下面的代码:
您对以下
browser.quit()
的呼叫因此,在请求
.get()
之前,您似乎正在关闭浏览器实例,该请求反过来检索所需的内容。尝试将该行添加到循环的末尾,以便在下一次迭代中创建一个新会话相关问题 更多 >
编程相关推荐