从htm提取数组元素

2024-09-27 19:26:16 发布

您现在位置:Python中文网/ 问答频道 /正文

我使用urlopen和beautifulsoup4获取网页的内容。 我正在获取的网页生成一些动态javascript块。 我想提取整个数组的内容。在

数组的格式如下:

<script type="text/javascript">
var jobmap = {};
jobmap[0]= {jk:'929a2508c8bf2c9c',efccid: '28d4bd688c1e4e86',srcid:'4beb17a7fc4b64e2',cmpid:'be1c2a3db344744f',num:'0',srcname:'City of Oshawa',cmp:'City of Oshawa',cmpesc:'City of Oshawa',cmplnk:'/City-of-Oshawa-jobs-in-Ontario',loc:'Oshawa, ON',country:'CA',zip:'',city:'Oshawa',title:'Systems Analyst',locid:'da5ca33120fa5fe5',rd:'8i0xAbEkuWUhy6dasPEQkceDzWLtCZmZLj2Y-bGYlQI'};
jobmap[1]= {jk:'2d06bbaac441e7d2',efccid: 'beb412fe8b0feacc',srcid:'0a0f0bf6b7639c78',cmpid:'0c05d4e9f9f0206d',num:'1',srcname:'FGL Sports Ltd.',cmp:'FGL Sports Ltd.',cmpesc:'FGL Sports Ltd.',cmplnk:'/FGL-Sports-jobs-in-Ontario',loc:'Ontario',country:'CA',zip:'',city:'',title:'Decision Support Analyst',locid:'8b17acc5f001bdbf',rd:'v7_ZQyGHijdq7ng-cswbFDpj7KoE_Ia4YknbAcijYgE'};
</script>

数组包含未知数量的元素。 如何提取整个数组的内容并将其保存到json对象中?在


Tags: of网页city内容script数组javascriptsports
1条回答
网友
1楼 · 发布于 2024-09-27 19:26:16

BeautifulSoup只能帮助解决问题的一部分-定位包含所需对象的所需script元素。然后,您需要使用javascript解析器,如^{},或正则表达式,例如,如下所示:

import json
import re
from bs4 import BeautifulSoup


data = """
<script type="text/javascript">
var jobmap = {};
jobmap[0]= {jk:'929a2508c8bf2c9c',efccid: '28d4bd688c1e4e86',srcid:'4beb17a7fc4b64e2',cmpid:'be1c2a3db344744f',num:'0',srcname:'City of Oshawa',cmp:'City of Oshawa',cmpesc:'City of Oshawa',cmplnk:'/City-of-Oshawa-jobs-in-Ontario',loc:'Oshawa, ON',country:'CA',zip:'',city:'Oshawa',title:'Systems Analyst',locid:'da5ca33120fa5fe5',rd:'8i0xAbEkuWUhy6dasPEQkceDzWLtCZmZLj2Y-bGYlQI'};
jobmap[1]= {jk:'2d06bbaac441e7d2',efccid: 'beb412fe8b0feacc',srcid:'0a0f0bf6b7639c78',cmpid:'0c05d4e9f9f0206d',num:'1',srcname:'FGL Sports Ltd.',cmp:'FGL Sports Ltd.',cmpesc:'FGL Sports Ltd.',cmplnk:'/FGL-Sports-jobs-in-Ontario',loc:'Ontario',country:'CA',zip:'',city:'',title:'Decision Support Analyst',locid:'8b17acc5f001bdbf',rd:'v7_ZQyGHijdq7ng-cswbFDpj7KoE_Ia4YknbAcijYgE'};
</script>"""

soup = BeautifulSoup(data, "html.parser")
script = soup.find("script", text=lambda text: "var jobmap" in text)

pattern = re.compile(r"jobmap\[\d+\]\s*=\s*({.*?})")
for item in pattern.findall(script.get_text(), re.MULTILINE):
    print(item)

印刷品:

^{pr2}$

请注意,每个item值都是不能用json.loads()直接加载的,请考虑使用^{}或其他方式将javascript对象字符串加载到Python字典中:

相关问题 更多 >

    热门问题