检索图形信息web抓取

2024-09-29 22:37:38 发布

您现在位置:Python中文网/ 问答频道 /正文

我是新的网页抓取,我需要一些帮助我的查询。 在这一页https://ski-resort-stats.com/Hemsedal/,HEMSEDAL的雪历史部分,我试图检索图表上的信息(每年的降雪量)。我试图从一年(2013-2014)开始

我想我在html代码中找到了相关的部分: Screenshot from the html code

为此:

from bs4 import BeautifulSoup                                                                         
import requests

url="https://ski-resort-stats.com/Hemsedal/"                                              
html_content = requests.get(url).text
soup = BeautifulSoup(html_content, "lxml")                             
gdp_table = soup.find("g", attrs={"class": "highcharts-markers highcharts-series-0 highcharts-spline-series highcharts-color-0 highcharts-tracker"})                                                                       
gdp_table_data = gdp_table.tbody.find_all("path")`

但我要纠正这个错误:“AttributeError:'非类型'对象没有属性'tbody'”。尝试使用html代码中的其他元素,但未成功。有人能帮我吗


Tags: 代码fromhttpsimportcomhtmlstatstable
2条回答

正如@joni所指出的,该站点在最初加载javascript之后运行javascript,用图形数据填充页面。下面的代码使用^{}加载页面,在图上抓取2013-2014的数据点元素,然后将鼠标悬停在每个点上,以便显示包含实际数据的信息工具提示:

from selenium import webdriver
from selenium.webdriver.common.action_chains import ActionChains
import time
d = webdriver.Chrome('/path/to/chromedriver')
d.get('https://ski-resort-stats.com/Hemsedal/')
results = []
for i in d.execute_script('''return document.querySelectorAll('g > path[fill="#7cb5ec"]')''')[:-1]:
    a = ActionChains(d)
    a.move_to_element(i).perform()
    time.sleep(0.3)
    results.append(d.execute_script('''
     function* get_hover_data(y_range){
         for (var i of document.querySelectorAll('text[x="8"][data-z-index="1"][y="18"]')){
             if (i.children.length === 5){
                 yield [i.children[0].textContent, i.children[3].textContent]
             }
         }

     }
     return [...get_hover_data('2013-2014')];
    '''))

_, *final_results = [i[0] for i in results if i]

输出:

[['45', '0'], ['46', '0'], ['47', '0'], ['48', '25'], ['49', '28'], ['50', '36'], ['51', '37'], ['52', '53,5'], ['1', '77,5'], ['2', '89,5'], ['3', '125,5'], ['4', '151,5'], ['5', '159,5'], ['6', '163,5'], ['7', '177,5'], ['8', '173,5'], ['9', '175'], ['10', '166'], ['11', '171'], ['12', '173,5'], ['13', '170'], ['14', '166'], ['15', '158,5']]

数据以JavaScript形式嵌入到页面中。您可以使用以下示例来解析它:

import re
import json
import requests

url = "https://ski-resort-stats.com/Hemsedal/"
html_doc = requests.get(url).text

data = re.search(r"wpDataCharts\[.*?\] = ({.*})", html_doc).group(1)
data = re.sub(r"([a-z_]+):", r'"\1":', data)
data = re.sub(r'"http":', "http:", data)
data = json.loads(data)

# uncomment this to print all data:
# print(json.dumps(data, indent=4))

for series in data["render_data"]["options"]["series"]:
    print(series["name"], series["data"])

print()
print("week =", data["render_data"]["options"]["xAxis"]["categories"])

印刷品:

2013-2014 [0, 0, 0, 25, 28, 36, 36.5, 53.5, 77.5, 89.5, 125.5, 151.5, 159.5, 163.5, 177.5, 173.5, 175, 166, 171, 173.5, 169.5, 166, 158.5]
2014-2015 [0, 0, 0, 52, 70, 67, 74.5, 78, 74, 88, 98, 102, 109.5, 113, 113, 109.5, 110.5, 98, 95, 95, 99, 108, 102]
2015-2016 [0, 0, 0, 11.5, 25.5, 34, 37, 52, 64, 76, 82, 73, 79, 105, 118, 120, 136, 141, 116, 116, 100, 97, 95]
2016-2017 [0, 0, 0, 42, 33, 22, 17, 31, 25, 40, 47, 22, 15, 17, 15, 20, 47, 50, 63, 59, 57, 51, 50]
2017-2018 [10, 10, 40, 66, 64, 64, 48, 67, 77, 85, 120, 120, 140, 155, 175, 175, 175, 175, 175, 168, 170, 180, 180]
2012-2013 [0, 0, 0, 61.5, 60, 61, 76.5, 90.5, 95, 85, 85, 85, 87.5, 100, 102.5, 102.5, 100.5, 104, 101, 100, 99, 97.5, 95]
2018-2019 [0, 0, 0, 0, 23, 33, 48, 49, 50, 75, 68, 68, 115, 115, 115, 80, 80, 85, 85, 110, 110, 110, 110]
2019-2020 [45, 45, 40, 80, 80, 80, 97, 107, 107, 107, 107, 113, 113, 113, 113, 118, 118, 127, 127, 0, 0, 0, 0]
week = [45, 46, 47, 48, 49, 50, 51, 52, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15]

相关问题 更多 >

    热门问题