如何使用python和BeautifulSoup从网站下载.qrs文件？

import os from bs4 import BeautifulSoup # Python 3.x from urllib.request import urlopen, urlretrieve URL = 'https://physionet.org/physiobank/database/shareedb/' OUTPUT_DIR = '' # path to output folder, '.' or '' uses current folder u = urlopen(URL) try: html = u.read().decode('utf-8') finally: u.close() soup = BeautifulSoup(html, "html.parser") for link in soup.select('a[href^="https://"]'): # or a[href*="shareedb/0"] href = link.get('href') if not any(href.endswith(x) for x in ['.dat','.hea','.qrs']): continue filename = os.path.join(OUTPUT_DIR, href.rsplit('/', 1)[-1]) # We need a https:// URL for this site # href = href.replace('http://','https://') print("Downloading %s to %s..." % (href, filename) ) urlretrieve(href, filename) print("Done.")

3条回答

网友

1楼 · 编辑于 2024-10-01 09:18:18

为了扩展wolf tian的答案，select没有找到任何内容，因为该站点中的链接在其href中没有"https://"（也没有"shareedb"）。您试图下载的文件都具有<a href="01911.hea">01911.hea</a>的结构。他们的路径是相对的。因此，您需要首先提取这些文件名，例如：

for link in soup.select('a'):
    href = link.get('href')
    if not href or not any(href.endswith(x) for x in ['.dat','.hea','.qrs']):
        continue

    filename = os.path.join(OUTPUT_DIR, href)

然后，您需要在检索URL之前将主机部分应用于该URL：

^{pr2}$

网友

2楼 · 编辑于 2024-10-01 09:18:18

您可以使用优秀的^{}库，如下所示：

import bs4            
import requests

url = "https://physionet.org/physiobank/database/shareedb/"
html = requests.get(url)
soup = bs4.BeautifulSoup(html.text, "html.parser")

for link in soup.find_all('a', href=True):
    href = link['href']

    if any(href.endswith(x) for x in ['.dat','.hea','.qrs']):
        print "Downloading '{}'".format(href)
        remote_file = requests.get(url + href)

        with open(href, 'wb') as f:
            for chunk in remote_file.iter_content(chunk_size=1024): 
                if chunk: 
                    f.write(chunk)

这将把所有.dat、.hea和.qrs文件下载到您的计算机上。在

使用标准安装：

^{pr2}$

注意，该URL上的所有href格式都适合直接用作文件名（因此目前不需要解析任何/字符）。在

网友

3楼 · 编辑于 2024-10-01 09:18:18

import requests
from bs4 import BeautifulSoup
from urllib.parse import urljoin

start_url = 'https://physionet.org/physiobank/database/shareedb/'
r = requests.get(start_url)
soup = BeautifulSoup(r.text, 'lxml')

# get full url of file
pre = soup.find('pre')
file_urls = pre.select('a[href*="."]')
full_urls = [urljoin(start_url, url['href'])for url in file_urls]
# download file
for full_url in full_urls:
    file_name = full_url.split('/')[-1]
    print("Downloading {} to {}...".format(full_url, file_name))
    with open(file_name, 'wb') as f:
        fr = requests.get(full_url, stream=True)
        for chunk in fr.iter_content(chunk_size=1024):
            f.write(chunk)
    print('Done')

输出：

^{pr2}$

相关问题更多 >

编程相关推荐

热门问题

热门文章