我正在尝试使用BeautifulSoup来刮取.xls表,这些表可以从Xcel Energy的网站(https://www.xcelenergy.com/working_with_us/municipalities/community_energy_reports)下载。你知道吗
此函数获取表的URL链接并尝试下载它们:
url = 'https://www.xcelenergy.com/working_with_us/municipalities/community_energy_reports'
dir = 'C:/Users/aobrien/PycharmProjects/xceldatascraper/'
def scraper(page):
from bs4 import BeautifulSoup as bs
import urllib.request
import requests
import os
import re
tld = r'https://www.xcelenergy.com'
pageobj = requests.get(page, verify=False)
sp = bs(pageobj.content, 'html.parser')
xlst, fnms = [], []
links = [a['href'] for a in sp.find_all('a', attrs={'href': re.compile("/staticfiles/")})]
for idx, a in enumerate(links):
if a.endswith('.xls'):
furl = tld + str(a)
xlst.append(furl)
fnms.append(a.split('/')[4])
naur = zip(fnms, xlst)
if not os.path.exists(dir + 'tables'):
os.makedirs(dir + 'tables')
for name, url in naur:
print(url)
res = urllib.request.urlopen(url)
xls = open(dir + 'tables/' + name, 'wb')
xls.write(res.read())
xls.close()
scraper(url)
脚本失败时urllib.request.urlopen(url)尝试访问文件,返回“urllib.error.HTTPError:HTTP错误404:找不到“。“print(url)”语句打印脚本构造的url(https://www.xcelenergy.com/staticfiles/xe-responsive/WorkingWith Us/MI-City-Forest-Lake-2016.xls),手动将该url粘贴到浏览器中,即可下载文件。你知道吗
我错过了什么?你知道吗
目前没有回答
相关问题 更多 >
编程相关推荐