用python从网站获取音频源链接

2024-06-24 12:40:16 发布

您现在位置:Python中文网/ 问答频道 /正文

我正在写一个脚本从一个网站获取音频源链接。通过抓取主页a获得可用链接的列表。但是当我抓取生成的链接时,我找不到来源。(它应该在a<;audio>;标记的href内)。

这是我的代码:

# -*- coding: utf-8 -*-
import urllib.request
from bs4 import BeautifulSoup

def getHTML(st):
    with urllib.request.urlopen(site+'/',timeout=100) as response:
        return response.read() 

site = 'http://www.e-radio.gr'
soup = BeautifulSoup(getHTML(site), 'html.parser')
# Parse Main Page And get links
lst = list()

for a in soup.body.find_all('a', {'class' : 'erplayer'}):
    item = a.get('href')
    if site in item:
        lst.append(item)
    else:
        lst.append(site + item)

print("\n".join(lst))

网站似乎没有正确加载,也没有加载音频源使用urllib.请求. 我还能用什么代替urllib.请求所以它会等待整个页面的加载。我想用一些外部的web浏览器来生成html,但是我不知道怎么做


Tags: import网站链接responserequesthtmlsiteurllib
1条回答
网友
1楼 · 发布于 2024-06-24 12:40:16

这有点棘手,但是我们可以一步一步地来实现——首先通过iframe链接获得播放器的HTML。然后,获取flashplayer链接并跟踪它。然后,提取到mp3的链接并下载流。所有这些都是在同一个网络抓取会话下进行的:

from urllib.parse import urljoin

import requests
from bs4 import BeautifulSoup


def download_file(session, link, path):
    r = session.get(link, stream=True)
    if r.status_code == 200:
        with open(path, 'wb') as f:
            for chunk in r:
                f.write(chunk)


base_url = "http://www.e-radio.gr"
url = "http://www.e-radio.gr/Rainbow-89-Thessaloniki-i92/live"

with requests.Session() as session:
    session.headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.96 Safari/537.36'}
    response = session.get(url)

    soup = BeautifulSoup(response.content, "html.parser")
    frame = soup.find(id="playerControls1")
    frame_url = urljoin(base_url, frame["src"])

    response = session.get(frame_url)
    soup = BeautifulSoup(response.content, "html.parser")
    link = soup.select_one(".onerror a")['href']
    flash_url = urljoin(response.url, link)

    response = session.get(flash_url)
    soup = BeautifulSoup(response.content, "html.parser")
    mp3_link = soup.select_one("param[name=flashvars]")['value'].split("url=", 1)[-1]
    print(mp3_link)

    download_file(session, mp3_link, "download.mp3")

相关问题 更多 >