用Python访问HTML<script>标记中的深度嵌套数据

2024-10-03 04:27:21 发布

您现在位置:Python中文网/ 问答频道 /正文

所以我尝试从一个有深度嵌套的<script>标记的站点获取特定的数据

使用import json,希望能让事情变得简单一点,结果导致了著名的Expecting value: line 1 column 1 (char 0)错误。因此,我尝试了以下方法1,但没有成功

本质上,连接到站点、捕获特定的<script>标记的相对简单的步骤是没有问题的。从中获取我需要的数据似乎有问题

假设以下元素:

script_tag = '''
<script id="startup" type="text/javascript">
$(document).ready(function () {createJsonChart({
"series":[{"name":"BNames","color":"#0043de","legendIndex":0,
"stack":null,
"data":[{"name":"BNames","color":"#0043de","y":0.0,
"legendIndex":0,
"events":{"click":function(){return false;}},
"subtotal":0.0,"displayValue":"0","tooltip":""},
{"name":"BNames","color":"#0043de","y":114.6,
"legendIndex":0,
"events":{"click":function(){return false;}},
"subtotal":0.0,"displayValue":"0",
"tooltip":"BNames: 114,60 % <br/> Month: oktober 2018"},
{"name":"BNames","color":"#0043de","y":108.5,
"legendIndex":0,
"events":{"click":function(){return false;}},
"subtotal":0.0,"displayValue":"0",
"tooltip":"BNames: 108,50 % <br/> Month: september 2019"},
{"name":"BNames","color":"#0043de","y":0.0,
"legendIndex":0,
"events":{"click":function(){return false;}},
"subtotal":0.0,"displayValue":"0","tooltip":""}]},
{"type":"line","marker":{"enabled":false,
"linecolor":null,"lineWidth":0,
"fillColor":null,"symbol":null,"radius":4},
"dashStyle":"Solid","lineWidth":2,
"step":"center","zIndex":"2","name":"Mandatory","color":"#f20808",
"legendIndex":0,"stack":1,
"data":[{"name":"Mandatory","color":"#f20808","y":104.1,
"legendIndex":0,
"events":{"click":function(){return false;}},"subtotal":0.0,"displayValue":"0",
"tooltip":"Mandatory: 104,10 %: 104,10 %"},
{"name":"Mandatory","color":"#f20808","y":104.1,
"legendIndex":0,
"events":{"click":function(){return false;}},
"subtotal":0.0,"displayValue":"0",
"tooltip":"Mandatory: 104,10 %"},
{"name":"Mandatory","color":"#f20808","y":104.1,
"legendIndex":0,
"events":{"click":function(){return false;}},
"subtotal":0.0,"displayValue":"0",
"tooltip":"Mandatory: 104,10 %"}]},
{"type":"line","marker":{"enabled":false,
"linecolor":null,"lineWidth":0,"fillColor":null,
"symbol":null,"radius":4},"dashStyle":"Solid","lineWidth":2,
"step":"center", "zIndex":"2","name":"Preferred","color":"#38d615",
"legendIndex":0,"stack":2,
"data":[{"name":"Preferred","color":"#38d615","y":121.0,
"legendIndex":0,
"events":{"click":function(){return false;}},"subtotal":0.0,"displayValue":"0",
"tooltip":"Preferred: 121,00 %: 121,00 %"},
{"name":"Preferred","color":"#38d615","y":121.0,
"legendIndex":0,
"events":{"click":function(){return false;}},"subtotal":0.0,"displayValue":"0",
"tooltip":"Preferred: 121,00 %"},
{"name":"Preferred","color":"#38d615","y":121.0,
"legendIndex":0,
"events":{"click":function(){return false;}},"subtotal":0.0,"displayValue":"0",
"tooltip":"Preferred: 121,00 %"}]}],
"resizeElement":null,"credits":{"enabled":false}});$('#__Page').lumnaInit('');});
</script>
'''

实际上,这个<script>标签更大。它包含3部分数据,分别命名为BNamesMandatoryPreferred。我需要来自BNames的数据,特别是最后一个条目。因此,预期的结果将来自在一个变量中有"tooltip":"BNames: 108,50 % <br/> Month: september 2019"}的部分,在另一个变量中有Month: september 2019

使用regex回答

url_part=soup.find("script", attrs={'id':'startup'}).text
info=re.findall(r'\s\w*\s\d*', url_part)[-1]
result=re.findall(r'(BNames: (\d+[,]\d+\s[%]))', url_part)[-1][1]

首先定义要处理的HTML标记。其次,找到所有大小为任意字母(\w*)、后跟空格(\s)和任意大小数字(\d*)的实例。这与2019年9月或2019年8月之类的内容匹配。最后,查找与BNames:匹配的实例,这些实例后面有数字:数字、逗号、数字、空格和百分号。因此(\d+[,]\d+\s[%]这确实匹配了80,6%到120,05%之间的所有内容


Tags: namefalsereturnfunctioneventsnullcolorclick
1条回答
网友
1楼 · 发布于 2024-10-03 04:27:21

Beleidsdekkingsgraad字符串上使用以下正则表达式匹配。对于b名字也有同样的想法

import re, requests

r = requests.get('https://www.pensioenfondstno.nl/overons/dekkingsgraad')
p = re.compile(r'"(Beleidsdekkingsgraad:[\s\S]*?)"', re.MULTILINE)
data = p.findall(r.text)[-1].split(' <br/> ')
print(data[0])
print(data[1])

正则表达式:

enter image description here

相关问题 更多 >