2024-10-08 18:22:06 发布
网友
我正在浏览这个网站:https://www.eafo.eu/alternative-fuels/electricity/charging-infra-stats
我无法使用bs4或Selenium提取动态图表值。我可以得到html,但没有数据值。当我使用Selenium时,我能够捕获html,但没有数据。我是否缺少任何东西来获取这个或一个能够操纵动态网页的更强大的工具
是的,这是一个有趣的问题,实际上可以欺骗许多人当网络抓取数据。。。问题是图表是在JavaScript中的文档就绪后加载的,您可以了解有关文档就绪here的更多信息。但本质上,图表是在加载所有HTML、CSS和JS之后呈现的,并且数据绑定到数据属性
我创建了一个代码示例,它使用NodeJS Express server返回JSON中所有图表中的数据。本质上,它点击URL,指向图表所在的类,然后查找包含图表所有数据的data-*attr。这样,当基于JavaScript的图表呈现出现这些情况时,您就可以使用和分叉工作代码
带有NodeJS和Python的GitHub repo解决方案:https://github.com/joehoeller/dynamic-chart-parser-for-webscraping
页面上的六个图表中的每一个都填充了来自各个API调用的数据,这些API调用可以在浏览器的网络设置下找到。您可以自己向这些端点发送请求并解析响应:
import urllib.parse, requests, json headers = {'authority': 'www.eafo.eu', 'pragma': 'no-cache', 'cache-control': 'no-cache', 'sec-ch-ua': '"Chromium";v="92", " Not A;Brand";v="99", "Google Chrome";v="92"', 'accept': 'application/json, text/javascript, */*; q=0.01', 'x-requested-with': 'XMLHttpRequest', 'sec-ch-ua-mobile': '?0', 'user-agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/92.0.4515.131 Safari/537.36', 'sec-fetch-site': 'same-origin', 'sec-fetch-mode': 'cors', 'sec-fetch-dest': 'empty', 'referer': 'https://www.eafo.eu/alternative-fuels/electricity/charging-infra-stats', 'accept-language': 'en-US,en;q=0.9', 'cookie': 'yearFilter=2020; activeSubMenu=electricity; subMenuActiveItem=charging_infra_stats; fuelFilter=Electricity; _ga=GA1.2.1782486955.1628797896; _gid=GA1.2.47726291.1628797896; _gat_gtag_UA_129775638_1=1'} params = (('compare', 'false'),) urls = ['https://www.eafo.eu/normal-and-fast-charge-points/-1/-1/-1/false/false/nvt?compare=false', 'https://www.eafo.eu/charging-positions-per-10-evs/-1/-1/-1/false/false/nvt?compare=false', 'https://www.eafo.eu/normal-power-charging-positions/-1/-1/-1/false/false/nvt?compare=false', 'https://www.eafo.eu/fillingstations-electricity-top-5/-1/-1/-1/false/false/nvt?compare=false', 'https://www.eafo.eu/fast-charging/-1/-1/-1/false/false/nvt?compare=false', 'https://www.eafo.eu/top-5-countries-charging-positions-per-10-evs/-1/-1/-1/false/false/nvt?compare=false'] data = [[urllib.parse.urlparse(url).path.split('/')[1], json.loads(requests.get(url, headers=headers, params=params).text)] for url in urls] result = {a:[[i['c'][0]['v'], i['c'][1]['v']] for i in b['data']['rows']] for a, b in data}
输出:
{'normal-and-fast-charge-points': [[2008, 0], [2009, 0], [2010, 0], [2011, 13], [2012, 257], [2013, 751], [2014, 1474], [2015, 3396], [2016, 5190], [2017, 8723], [2018, 11138], [2019, 15136], [2020, 24987]], 'charging-positions-per-10-evs': [['2008', 0], ['2009', 0], ['2010', '14'], ['2011', '6'], ['2012', '3'], ['2013', '4'], ['2014', '5'], ['2015', '5'], ['2016', '5'], ['2017', '5'], ['2018', '6'], ['2019', '7'], ['2020', '9']], 'normal-power-charging-positions': [['2008', 0], ['2009', 0], ['2010', 400], ['2011', 2379], ['2012', 10250], ['2013', 17093], ['2014', 24917], ['2015', 44786], ['2016', 70012], ['2017', 97287], ['2018', 107446], ['2019', 148880], ['2020', 199250]], 'fillingstations-electricity-top-5': [['Netherlands', 66461], ['France', 45413], ['Germany', 43633], ['Sweden', 13564], ['Italy', 13214]], 'fast-charging': [['2008', 0], ['2009', 0], ['2010', 0], ['2011', 13], ['2012', 257], ['2013', 751], ['2014', 1474], ['2015', 3396], ['2016', 5190], ['2017', 8723], ['2018', 11138], ['2019', 15136], ['2020', 24987]], 'top-5-countries-charging-positions-per-10-evs': [['Latvia', '3.15'], ['Slovakia', '4.34'], ['Croatia', '5.14'], ['Estonia', '5.31'], ['Netherlands', '5.71']]}
以更清晰的JSON格式:
t = {' '.join(map(str.capitalize, a.split('-'))):b for a, b in result.items()} print(json.dumps(t, indent=4))
{ "Normal And Fast Charge Points": [ [ 2008, 0 ], [ 2009, 0 ], [ 2010, 0 ], [ 2011, 13 ], [ 2012, 257 ], [ 2013, 751 ], [ 2014, 1474 ], [ 2015, 3396 ], [ 2016, 5190 ], [ 2017, 8723 ], [ 2018, 11138 ], [ 2019, 15136 ], [ 2020, 24987 ] ], "Charging Positions Per 10 Evs": [ [ "2008", 0 ], [ "2009", 0 ], [ "2010", "14" ], [ "2011", "6" ], [ "2012", "3" ], [ "2013", "4" ], [ "2014", "5" ], [ "2015", "5" ], [ "2016", "5" ], [ "2017", "5" ], [ "2018", "6" ], [ "2019", "7" ], [ "2020", "9" ] ], "Normal Power Charging Positions": [ [ "2008", 0 ], [ "2009", 0 ], [ "2010", 400 ], [ "2011", 2379 ], [ "2012", 10250 ], [ "2013", 17093 ], [ "2014", 24917 ], [ "2015", 44786 ], [ "2016", 70012 ], [ "2017", 97287 ], [ "2018", 107446 ], [ "2019", 148880 ], [ "2020", 199250 ] ], "Fillingstations Electricity Top 5": [ [ "Netherlands", 66461 ], [ "France", 45413 ], [ "Germany", 43633 ], [ "Sweden", 13564 ], [ "Italy", 13214 ] ], "Fast Charging": [ [ "2008", 0 ], [ "2009", 0 ], [ "2010", 0 ], [ "2011", 13 ], [ "2012", 257 ], [ "2013", 751 ], [ "2014", 1474 ], [ "2015", 3396 ], [ "2016", 5190 ], [ "2017", 8723 ], [ "2018", 11138 ], [ "2019", 15136 ], [ "2020", 24987 ] ], "Top 5 Countries Charging Positions Per 10 Evs": [ [ "Latvia", "3.15" ], [ "Slovakia", "4.34" ], [ "Croatia", "5.14" ], [ "Estonia", "5.31" ], [ "Netherlands", "5.71" ] ] }
是的,这是一个有趣的问题,实际上可以欺骗许多人当网络抓取数据。。。问题是图表是在JavaScript中的文档就绪后加载的,您可以了解有关文档就绪here的更多信息。但本质上,图表是在加载所有HTML、CSS和JS之后呈现的,并且数据绑定到数据属性
我创建了一个代码示例,它使用NodeJS Express server返回JSON中所有图表中的数据。本质上,它点击URL,指向图表所在的类,然后查找包含图表所有数据的data-*attr。这样,当基于JavaScript的图表呈现出现这些情况时,您就可以使用和分叉工作代码
带有NodeJS和Python的GitHub repo解决方案:https://github.com/joehoeller/dynamic-chart-parser-for-webscraping
页面上的六个图表中的每一个都填充了来自各个API调用的数据,这些API调用可以在浏览器的网络设置下找到。您可以自己向这些端点发送请求并解析响应:
输出:
以更清晰的JSON格式:
输出:
相关问题 更多 >
编程相关推荐