使用beautifulsoup和urllib从Json抓取

2024-10-03 11:14:47 发布

您现在位置:Python中文网/ 问答频道 /正文

我在一个使用json的示例网站上学习了一些技巧。例如,以以下示例网站为例:http://www.charitystars.com/product/juve-chelsea-3-0-champions-league-jersey-autographed-by-giorgio-chiellini。源代码在这里view-source:https://www.charitystars.com/product/juve-chelsea-3-0-champions-league-jersey-autographed-by-giorgio-chiellini。我想在第388-396行获得信息:

<script>
    var js_data = {"first_time_bid":true,"yourbid":0,"product":{"id":55,"item_number":"P55","type":"PRODUCT","fixed":0,"price":1000,"tot_price":1000,"min_bid_value":1010,"currency":"EUR","raise_bid":10,"stamp_end":"2013-06-14 12:00:00","bids_number":12,"estimated_value":200,"extended_time":0,"url":"https:\/\/www.charitystars.com\/product\/juve-chelsea-3-0-champions-league-jersey-autographed-by-giorgio-chiellini","conversion_value":1,"eid":0,"user_has_bidded":false},"bid":{"id":323,"uid":126,"first_name":"Fabio","last_name":"Gastaldi","company_name":"","is_company":0,"title":"fab1","nationality":"IT","amount":1000,"max_amount":0,"table":"","stamp":1371166006,"real_stamp":"2013-06-14 01:26:46"}};
    var p_currency = '€';
    var conversion_value = '1';
    var merch_items = [];
    var gallery_items = [];

    var inside_gala = false;
</script>

并将每个变量用引号(即“id”、“item_number”、“type”、…)保存在具有相同名称的变量中

到目前为止,我成功地运行了以下程序

import requests
from bs4 import BeautifulSoup
from urllib import urlopen
import re
import json
import time
import csv
from bs4 import BeautifulSoup as soup
from pandas import DataFrame

import urllib2
hdr = {"User-Agent": "My Agent"}

req = urllib2.Request(http://www.charitystars.com/product/juve-chelsea-3-0-champions-league-jersey-autographed-by-giorgio-chiellini)
response = urllib2.urlopen(req)

htmlSource = response.read()
soup = BeautifulSoup(htmlSource)

title = soup.find_all("span", {"itemprop": "name"}) # get the title

script_soup = soup.find_all("script")

出于某种原因,script_soup有很多我不需要的信息。我相信我需要的部分在script_soup[9],但我不知道如何(以有效的方式)访问它。我真的很感谢你的帮助


Tags: importcombyvarwwwscriptproductsoup
2条回答

如果您可以使用requestslxml模块,那么您可以使用此模块

根据OP更新

import requests
from lxml import html
import json

header = {
    'User-Agent': ('Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko)'
                   ' Chrome/85.0.4183.102 Safari/537.36 Edg/85.0.564.51'),
    'X-Requested-With': 'XMLHttpRequest'
}


url='http://www.charitystars.com/product/juve-chelsea-3-0-champions-league-jersey-autographed-by-giorgio-chiellini'
a = requests.get(url,headers=header)

a = html.fromstring(a.text).xpath('//*[@class="page-content"]/script/text()')[0]
a = a.replace('\n','').replace(' ','')
b = a.split(';')
b = [i.split('=') for i in b]
c = json.loads(b[0][1])
c['product']

数据确实在script_soup[9]中。问题是这是一个硬编码在脚本标记中的json字符串。您可以使用script_soup[9].string以明文形式获取字符串,然后使用split()(如我的示例中所示)或regex提取json字符串。然后使用json.loads()将字符串作为python字典加载

import requests
from bs4 import BeautifulSoup
from pandas import DataFrame
import json

hdr = {"User-Agent": "My Agent"}
response = requests.get("http://www.charitystars.com/product/juve-chelsea-3-0-champions-league-jersey-autographed-by-giorgio-chiellini", headers=hdr)

soup = BeautifulSoup(response.content)
script_soup = soup.find_all("script")
data = json.loads(script_soup[9].string.split('= ')[1].split(';')[0])

数据现在存储在变量data中。您可以根据需要对其进行解析,也可以使用pd.DataFrame(data)将其加载到pandas

相关问题 更多 >