如何使用beauthulsoup从JSON数据生成Python Dict

2024-09-29 20:31:00 发布

您现在位置:Python中文网/ 问答频道 /正文

我正在抓取一个我需要某些信息的网站。{cd1>后面的信息是:

<script type="text/javascript">
    Sw.preloadedData = {};
    Sw.preloadedData["overview"] = {"Title":"Facebook","Description":"A social utility that connects people, to keep up with friends, upload photos, share links and videos.","GlobalRank":[1,28115594,0],"Country":840,"CountryRanks":{"12":[1,830254,0],"818":[1,463162,0],"604":[1,599566,0],"608":[1,986465,0],"484":[1,1329484,0],"504":[1,672216,0],"862":[1,724854,0],"688":[1,534093,0],"702":[1,427637,0],"703":[1,341310,0],"756":[1,903074,0],"840":[1,5142062,0],"250":[1,1887449,0],"724":[2,1432992,0],"764":[1,857348,0],"76":[1,2733763,0],"784":[1,564929,0],"376":[1,390754,0],"792":[1,979507,0],"804":[8,1073943,1],"344":[1,284415,0],"348":[1,471458,0],"643":[8,1424933,1],"682":[1,692392,0],"380":[1,1441457,0],"392":[3,979893,1],"170":[1,1048409,0],"191":[1,348589,0],"620":[1,841554,0],"642":[1,814441,0],"356":[2,2356839,1],"528":[1,1092022,0],"616":[2,1430485,0],"360":[1,1541560,0],"372":[1,361215,0],"458":[1,851821,0],"36":[1,857177,0],"578":[1,349987,0],"586":[1,553155,0],"704":[2,918752,0],"710":[2,439567,1],"826":[1,2062694,0],"124":[1,1950051,0],"752":[1,577990,0],"300":[1,654931,0],"203":[1,623702,0],"208":[1,350294,0],"32":[1,1223765,0],"100":[1,473283,0],"554":[1,268216,0],"56":[1,1124680,0],"152":[1,725504,0],"156":[25,375144,-1],"158":[1,408462,0],"276":[1,2752131,0],"40":[1,700209,0],"410":[1,327519,0],"246":[1,387528,0]},"Category":"Internet_and_Telecom/Social_Network","CategoryRank":[1,27564,0],"TrafficReach":[0.32364475161620337,0.32385066912312122,0.32476481437213323,0.31948943452696626,0.310612833573507,0.30867420840432391,0.30666509584041279,0.31334128772658171,0.33551546090119239,0.3260064922555041,0.33396164810609369,0.33999592327084549,0.33711315799626795,0.32152719433964483,0.31986157880865085,0.32069766148623413,0.3306823871380894,0.32266565637788247,0.29034777869603251,0.29286953998372667,0.29969130766646174,0.3071060984450904,0.28517166164955293,0.29038329556338477,0.2845053957123595],"TrafficReachStart":1346457600,"TrafficReachEnd":1362096000,"Engagments":[{"Year":2012,"Month":9,"Reach":[0.32364839148251978,0.012621437484750864],"Time":[1225.8536260294338,0.00090266734593069664],"PPV":[21.312597646825566,0.034059623863791355],"Bounce":[0.18813037420762707,0.043481349041723627]},{"Year":2012,"Month":10,"Reach":[0.31325536305080282,-0.032112096661782052],"Time":[1308.5613956266043,0.0674695313053506],"PPV":[25.612224490959978,0.20174109770119109],"Bounce":[0.17672838267013638,-0.060606861520974054]},{"Year":2012,"Month":11,"Reach":[0.33350274816471975,0.064635398151613677],"Time":[1300.8263833937028,-0.0059110808699942563],"PPV":[24.020971463806184,-0.062128653749518592],"Bounce":[0.186024790640559,0.052602801145837264]},{"Year":2012,"Month":12,"Reach":[0.32441610872340648,-0.027246070658540122],"Time":[1331.3137947173564,0.023436956470790138],"PPV":[24.916914500937356,0.03729836815638965],"Bounce":[0.18107629094748873,-0.026601291559208651]},{"Year":2013,"Month":1,"Reach":[0.29998222452228729,-0.075316494909170029],"Time":[1334.5042854365543,0.0023964979044441836],"PPV":[25.52485794831804,0.024398825438752159],"Bounce":[0.18097482510209897,-0.00056034859593612207]},{"Year":2013,"Month":2,"Reach":[0.2842911869016958,-0.052306557982157109],"Time":[1281.8427161473487,-0.039461521303379321],"PPV":[23.201378273544368,-0.091028113828417134],"Bounce":[0.18673378186827794,0.031821866731629678]}],"TrafficSources":{"Search":0.12679771428369516,"Social":0.0095590714393366649,"Mail":0.018352638254343783,"Paid Referrals":0.0010665044954870533,"Direct":0.60148809501325917,"Referrals":0.24273597651387802},"RedirectUrl":"facebook.com"};
    Sw.period = { month:2 ,year:2013,period:6 };
    Sw.siteDomain = "Facebook.com";
    Sw.siteCategory = "Internet_and_Telecom/Social_Network";
    Sw.siteCountry = "840";

</script>

如果我用beautifulsoup选择了script标记,那么如何才能得到它(JSON?)字典是Python字典吗?在

首先,我只需要选择那个JSON对象-我该怎么做?

而我需要将JSON对象转换为Python Dict


Tags: and信息jsonfacebooktimescriptsocialsw
2条回答

您需要进行一些文本处理:

import json

scriptline = next((line for line in scripttag.string.splitlines()
    if 'Sw.preloadedData["overview"]' in line))
data = scriptline.split('=', 1)[1].strip(' ;')
data = json.loads(data)

next(..., '')调用选择包含Sw.preloadedData["overview"]的第一行。然后,我们在=上拆分该行一次,取下该行的其余部分,删除空白和分号,然后将其解释为JSON。在

这给了我:

^{pr2}$

如果您的值定义跨越多行,我们可以使用^{} method使解析该信息更加容易:

import json

script_rest = scripttag.string.split('Sw.preloadedData["overview"]', 1)[1].lstrip(' =')
decoder = json.JSONDecoder()
data, _ = decoder.raw_decode(script_rest)

即使有尾随数据,raw_decode()调用也会解析JSON,因此它将尝试查找从Sw.preloadedData["overview"]文本后面的=开始的完整JSON对象。在

比如:

import re, json
jsondata = json.loads(re.search(r'Sw\.preloadedData\["overview"\] = (.*)', data).group(1).rstrip(';'))

相关问题 更多 >

    热门问题