为什么我无法从该网站的超链接中删除URL?

2024-09-30 10:28:15 发布

您现在位置:Python中文网/ 问答频道 /正文

我试图从这个网站的超链接中提取URL:https://riwayat-file-covid-19-dki-jakarta-jakartagis.hub.arcgis.com/

我使用了以下Python代码:

import requests
from bs4 import BeautifulSoup

url = "https://riwayat-file-covid-19-dki-jakarta-jakartagis.hub.arcgis.com/"
req = requests.get(url, headers)
soup = BeautifulSoup(req.content, 'html.parser')
print(soup.prettify())

links = soup.find_all('a')

for link in links:
    if "href" in link.attrs:
        print(str(link.attrs['href'])+"\n")

问题是这段代码没有返回任何URL

我想获取所有这些URL:
I want to get all of this urls


Tags: 代码httpsimportcomurllinkfilehub
2条回答

链接由javascript代码动态生成,数据可以在下面的结构中找到

<script id="site-injection">
      window.__SITE="your data is here"
</script>

因此,您需要获取这个script元素并解析window.__SITE的值

您无法解析它,因为数据是动态加载的。如下图所示,下载HTML源代码时,写入页面的HTML数据实际上并不存在。JavaScript随后解析window.__SITE变量并从中提取数据:

code screenshot

但是,我们可以在Python中复制这一点。下载网页后:

import requests

url = "https://riwayat-file-covid-19-dki-jakarta-jakartagis.hub.arcgis.com/"
req = requests.get(url)

您可以使用re(regex)提取编码页源:

import re

encoded_data = re.search("window\.__SITE=\"(.*)\"", req.text).groups()[0]

之后,您可以使用urllib对文本进行URL解码,并使用json解析JSON字符串数据:

from urllib.parse import unquote
from json import loads

json_data = loads(unquote(encoded_data))

然后,您可以解析JSON树以获取HTML源数据:

html_src = json_data["site"]["data"]["values"]["layout"]["sections"][1]["rows"][0]["cards"][0]["component"]["settings"]["markdown"]

此时,您可以使用自己的代码解析HTML:

soup = BeautifulSoup(html_src, 'html.parser')
print(soup.prettify())

links = soup.find_all('a')

for link in links:
    if "href" in link.attrs:
        print(str(link.attrs['href'])+"\n")

如果你把它们放在一起,这里是最后的脚本:

import requests
import re
from urllib.parse import unquote
from json import loads
from bs4 import BeautifulSoup

# Download URL
url = "https://riwayat-file-covid-19-dki-jakarta-jakartagis.hub.arcgis.com/"
req = requests.get(url)

# Get encoded JSON from HTML source
encoded_data = re.search("window\.__SITE=\"(.*)\"", req.text).groups()[0]

# Decode and load as dictionary
json_data = loads(unquote(encoded_data))

# Get the HTML source code for the links
html_src = json_data["site"]["data"]["values"]["layout"]["sections"][1]["rows"][0]["cards"][0]["component"]["settings"]["markdown"]

# Parse it using BeautifulSoup
soup = BeautifulSoup(html_src, 'html.parser')
print(soup.prettify())

# Get links
links = soup.find_all('a')

# For each link...
for link in links:
    if "href" in link.attrs:
        print(str(link.attrs['href'])+"\n")

相关问题 更多 >

    热门问题