<p>您有一个最常见的问题-浏览器使用<code>JavaScript</code>向页面添加链接(当您单击年份时),但<code>requests</code>/<code>beatifulsoup</code>无法运行<code>JavaScript</code></p>
<p>您必须关闭浏览器中的<code>JavaScript</code>,并检查是否可以在不使用<code>JavaScript</code>的情况下在浏览器中获取文件。然后你必须看看它是如何工作的,并在代码中做同样的事情。但有时它可能需要<a href="https://selenium-python.readthedocs.io/" rel="nofollow noreferrer">Selenium</a>来控制可以运行<code>JavaScript</code>的真实浏览器</p>
<hr/>
<p>当我在浏览器中打开不带<code>JavaScript</code>的URL时,我看不到任何<code>.xls</code>。我必须单击<code>year</code>,然后它用<code>.xls</code>加载不同的URL</p>
<p>2017年:<a href="https://lfportal.loudoun.gov/LFPortalinternet/0/fol/213974/Row1.aspx" rel="nofollow noreferrer">https://lfportal.loudoun.gov/LFPortalinternet/0/fol/213974/Row1.aspx</a><br/>
2018年:<a href="https://lfportal.loudoun.gov/LFPortalinternet/0/fol/285051/Row1.aspx" rel="nofollow noreferrer">https://lfportal.loudoun.gov/LFPortalinternet/0/fol/285051/Row1.aspx</a><br/>
2019年:<a href="https://lfportal.loudoun.gov/LFPortalinternet/0/fol/312510/Row1.aspx" rel="nofollow noreferrer">https://lfportal.loudoun.gov/LFPortalinternet/0/fol/312510/Row1.aspx</a><br/>
2020年:<a href="https://lfportal.loudoun.gov/LFPortalinternet/0/fol/384496/Row1.aspx" rel="nofollow noreferrer">https://lfportal.loudoun.gov/LFPortalinternet/0/fol/384496/Row1.aspx</a><br/>
2021年:<a href="https://lfportal.loudoun.gov/LFPortalinternet/0/fol/466963/Row1.aspx" rel="nofollow noreferrer">https://lfportal.loudoun.gov/LFPortalinternet/0/fol/466963/Row1.aspx</a></p>
<p>您必须使用<code>beautifulsoup</code>查找这些URL,并使用<code>requests</code>加载它们,然后您应该搜索<code>.xls</code></p>
<hr/>
<p><strong>编辑:</strong></p>
<p>代码搜索子页面并使用它们下载文件</p>
<p>它每年下载到单独的文件夹</p>
<pre><code>import requests
from bs4 import BeautifulSoup as bs
import os
# - functions -
def get_soup(url):
response = requests.get(url)
#print(response.status_code)
#print(response.text)
html = response.text
soup = bs(html, 'html.parser')
#soup = bs(html, 'lxml')
#soup = bs(html, 'html5lib')
return soup
# - main -
# - data -
DOMAIN = 'https://lfportal.loudoun.gov/LFPortalinternet/'
URL = 'https://lfportal.loudoun.gov/LFPortalinternet/Browse.aspx?startid=213973&row=1&dbid=0'
FILETYPE = '.xls'
# - code -
soup = get_soup(URL)
for folder_link in soup.find_all('a', {'class': 'DocumentBrowserNameLink'}):
folder_name = folder_link.get('aria-label').split(' ')[0]
folder_link = folder_link.get('href')
print('folder:', folder_name)
os.makedirs(folder_name, exist_ok=True)
subsoup = get_soup(DOMAIN + folder_link)
for file_link in subsoup.find_all('a', {'class': 'DocumentBrowserNameLink'}):
file_name = file_link.get('aria-label')[:-4] # skip extra `.xls` at the end
file_link = file_link.get('href')
if file_link.endswith(FILETYPE):
print(' file:', file_name)
file_name = os.path.join(folder_name, file_name)
with open(file_name, 'wb') as file:
response = requests.get(DOMAIN + file_link)
file.write(response.content)
</code></pre>
<hr/>
<p><strong>顺便说一句:</strong>我把它放在GitHub上了<a href="https://github.com/furas/python-examples/tree/master/__scraping__/lfportal.loudoun.gov%20-%20requests%2C%20BS" rel="nofollow noreferrer">furas/python-examples</a></p>