使用python从url抓取和下载excel文件问题的回答

使用python从url抓取和下载excel文件

回答此问题可获得 20 贡献值，回答如果被采纳可获得 50 分。

0 条评论
分类：Python问答

默认排序时间排序

1 个回答

匿名 1天前

　擅长：python、mysql、java

您有一个最常见的问题-浏览器使用<code>JavaScript</code>向页面添加链接（当您单击年份时），但<code>requests</code>/<code>beatifulsoup</code>无法运行<code>JavaScript</code> 您必须关闭浏览器中的<code>JavaScript</code>，并检查是否可以在不使用<code>JavaScript</code>的情况下在浏览器中获取文件。然后你必须看看它是如何工作的，并在代码中做同样的事情。但有时它可能需要<a href="https://selenium-python.readthedocs.io/" rel="nofollow noreferrer">Selenium</a>来控制可以运行<code>JavaScript</code>的真实浏览器 <hr/> 当我在浏览器中打开不带<code>JavaScript</code>的URL时，我看不到任何<code>.xls</code>。我必须单击<code>year</code>，然后它用<code>.xls</code>加载不同的URL 2017年：<a href="https://lfportal.loudoun.gov/LFPortalinternet/0/fol/213974/Row1.aspx" rel="nofollow noreferrer">https://lfportal.loudoun.gov/LFPortalinternet/0/fol/213974/Row1.aspx</a> 2018年：<a href="https://lfportal.loudoun.gov/LFPortalinternet/0/fol/285051/Row1.aspx" rel="nofollow noreferrer">https://lfportal.loudoun.gov/LFPortalinternet/0/fol/285051/Row1.aspx</a> 2019年：<a href="https://lfportal.loudoun.gov/LFPortalinternet/0/fol/312510/Row1.aspx" rel="nofollow noreferrer">https://lfportal.loudoun.gov/LFPortalinternet/0/fol/312510/Row1.aspx</a> 2020年：<a href="https://lfportal.loudoun.gov/LFPortalinternet/0/fol/384496/Row1.aspx" rel="nofollow noreferrer">https://lfportal.loudoun.gov/LFPortalinternet/0/fol/384496/Row1.aspx</a> 2021年：<a href="https://lfportal.loudoun.gov/LFPortalinternet/0/fol/466963/Row1.aspx" rel="nofollow noreferrer">https://lfportal.loudoun.gov/LFPortalinternet/0/fol/466963/Row1.aspx</a> 您必须使用<code>beautifulsoup</code>查找这些URL，并使用<code>requests</code>加载它们，然后您应该搜索<code>.xls</code> <hr/> 编辑： 代码搜索子页面并使用它们下载文件 它每年下载到单独的文件夹 <pre><code>import requests from bs4 import BeautifulSoup as bs import os # - functions - def get_soup(url): response = requests.get(url) #print(response.status_code) #print(response.text) html = response.text soup = bs(html, 'html.parser') #soup = bs(html, 'lxml') #soup = bs(html, 'html5lib') return soup # - main - # - data - DOMAIN = 'https://lfportal.loudoun.gov/LFPortalinternet/' URL = 'https://lfportal.loudoun.gov/LFPortalinternet/Browse.aspx?startid=213973&row=1&dbid=0' FILETYPE = '.xls' # - code - soup = get_soup(URL) for folder_link in soup.find_all('a', {'class': 'DocumentBrowserNameLink'}): folder_name = folder_link.get('aria-label').split(' ')[0] folder_link = folder_link.get('href') print('folder:', folder_name) os.makedirs(folder_name, exist_ok=True) subsoup = get_soup(DOMAIN + folder_link) for file_link in subsoup.find_all('a', {'class': 'DocumentBrowserNameLink'}): file_name = file_link.get('aria-label')[:-4] # skip extra `.xls` at the end file_link = file_link.get('href') if file_link.endswith(FILETYPE): print(' file:', file_name) file_name = os.path.join(folder_name, file_name) with open(file_name, 'wb') as file: response = requests.get(DOMAIN + file_link) file.write(response.content) </code></pre> <hr/> 顺便说一句：我把它放在GitHub上了<a href="https://github.com/furas/python-examples/tree/master/__scraping__/lfportal.loudoun.gov%20-%20requests%2C%20BS" rel="nofollow noreferrer">furas/python-examples</a>

使用python从url抓取和下载excel文件

1 个回答

相关Python问题