<p>您无法解析它,因为数据是动态加载的。如下图所示,下载HTML源代码时,写入页面的HTML数据实际上并不存在。JavaScript随后解析<code>window.__SITE</code>变量并从中提取数据:</p>
<p><a href="https://i.stack.imgur.com/SVIk2.png" rel="nofollow noreferrer"><img src="https://i.stack.imgur.com/SVIk2.png" alt="code screenshot"/></a></p>
<p>但是,我们可以在Python中复制这一点。下载网页后:</p>
<pre class="lang-py prettyprint-override"><code>import requests
url = "https://riwayat-file-covid-19-dki-jakarta-jakartagis.hub.arcgis.com/"
req = requests.get(url)
</code></pre>
<p>您可以使用<code>re</code>(regex)提取编码页源:</p>
<pre class="lang-py prettyprint-override"><code>import re
encoded_data = re.search("window\.__SITE=\"(.*)\"", req.text).groups()[0]
</code></pre>
<p>之后,您可以使用<code>urllib</code>对文本进行URL解码,并使用<code>json</code>解析JSON字符串数据:</p>
<pre class="lang-py prettyprint-override"><code>from urllib.parse import unquote
from json import loads
json_data = loads(unquote(encoded_data))
</code></pre>
<p>然后,您可以解析JSON树以获取HTML源数据:</p>
<pre class="lang-py prettyprint-override"><code>html_src = json_data["site"]["data"]["values"]["layout"]["sections"][1]["rows"][0]["cards"][0]["component"]["settings"]["markdown"]
</code></pre>
<p>此时,您可以使用自己的代码解析HTML:</p>
<pre class="lang-py prettyprint-override"><code>soup = BeautifulSoup(html_src, 'html.parser')
print(soup.prettify())
links = soup.find_all('a')
for link in links:
if "href" in link.attrs:
print(str(link.attrs['href'])+"\n")
</code></pre>
<p>如果你把它们放在一起,这里是最后的脚本:</p>
<pre class="lang-py prettyprint-override"><code>import requests
import re
from urllib.parse import unquote
from json import loads
from bs4 import BeautifulSoup
# Download URL
url = "https://riwayat-file-covid-19-dki-jakarta-jakartagis.hub.arcgis.com/"
req = requests.get(url)
# Get encoded JSON from HTML source
encoded_data = re.search("window\.__SITE=\"(.*)\"", req.text).groups()[0]
# Decode and load as dictionary
json_data = loads(unquote(encoded_data))
# Get the HTML source code for the links
html_src = json_data["site"]["data"]["values"]["layout"]["sections"][1]["rows"][0]["cards"][0]["component"]["settings"]["markdown"]
# Parse it using BeautifulSoup
soup = BeautifulSoup(html_src, 'html.parser')
print(soup.prettify())
# Get links
links = soup.find_all('a')
# For each link...
for link in links:
if "href" in link.attrs:
print(str(link.attrs['href'])+"\n")
</code></pre>