使用BeautifulSoup（Jupyter笔记本）进行网络垃圾处理

import requests import urllib.request import time from bs4 import BeautifulSoup url = 'https://data.toerismevlaanderen.be/' response = requests.get(url) soup = BeautifulSoup(response.text, 'html.parser') soup.findAll('a') one_a_tag = soup.findAll('a')[35] link = one_a_tag['href'] download_url = 'https://data.toerismevlaanderen.be/'+ link urllib.request.urlretrieve(download_url,'./'+link[link.find('/tourist/reca/beer_bars_')+1:]) time.sleep

3条回答

网友

1楼 · 编辑于 2024-09-28 23:47:03

这个有an API，所以我会用它

例如

import requests

r = requests.get('https://opendata.visitflanders.org/tourist/reca/beer_bars.json?page=1&page_size=500&limit=1').json()

网友

2楼 · 编辑于 2024-09-28 23:47:03

你得到许多绝对链接作为回报。将其添加到新请求的原始url将不起作用。只需请求你抓取的“链接”就可以了

网友

3楼 · 编辑于 2024-09-28 23:47:03

问题如下：

link = one_a_tag['href']
print(link)

这将返回一个链接：https://data.toerismevlaanderen.be/

然后通过执行以下操作将此link添加到download_url：

download_url = 'https://data.toerismevlaanderen.be/'+ link

因此，如果你print(download_url)，你会得到：

https://data.toerismevlaanderen.be/https://data.toerismevlaanderen.be/

它不是有效的url。你知道吗

根据评论更新

问题是，在你所抓取的文本中没有tourist/activities/breweries。如果你写：

for link in soup.findAll('a'):
  print(link.get('href'))

你可以看到所有的a href标签。没有包含tourist/activities/breweries

但是如果您只需要链接data.toerismevlaanderen.be/tourist/activities/breweries，可以执行以下操作：

download_url = link + "tourist/activities/breweries"

相关问题更多 >

编程相关推荐

热门问题

热门文章