我在一个使用json的示例网站上学习了一些技巧。例如,以以下示例网站为例:http://www.charitystars.com/product/juve-chelsea-3-0-champions-league-jersey-autographed-by-giorgio-chiellini。源代码在这里view-source:https://www.charitystars.com/product/juve-chelsea-3-0-champions-league-jersey-autographed-by-giorgio-chiellini
。我想在第388-396行获得信息:
<script>
var js_data = {"first_time_bid":true,"yourbid":0,"product":{"id":55,"item_number":"P55","type":"PRODUCT","fixed":0,"price":1000,"tot_price":1000,"min_bid_value":1010,"currency":"EUR","raise_bid":10,"stamp_end":"2013-06-14 12:00:00","bids_number":12,"estimated_value":200,"extended_time":0,"url":"https:\/\/www.charitystars.com\/product\/juve-chelsea-3-0-champions-league-jersey-autographed-by-giorgio-chiellini","conversion_value":1,"eid":0,"user_has_bidded":false},"bid":{"id":323,"uid":126,"first_name":"Fabio","last_name":"Gastaldi","company_name":"","is_company":0,"title":"fab1","nationality":"IT","amount":1000,"max_amount":0,"table":"","stamp":1371166006,"real_stamp":"2013-06-14 01:26:46"}};
var p_currency = '€';
var conversion_value = '1';
var merch_items = [];
var gallery_items = [];
var inside_gala = false;
</script>
并将每个变量用引号(即“id”、“item_number”、“type”、…)保存在具有相同名称的变量中
到目前为止,我成功地运行了以下程序
import requests
from bs4 import BeautifulSoup
from urllib import urlopen
import re
import json
import time
import csv
from bs4 import BeautifulSoup as soup
from pandas import DataFrame
import urllib2
hdr = {"User-Agent": "My Agent"}
req = urllib2.Request(http://www.charitystars.com/product/juve-chelsea-3-0-champions-league-jersey-autographed-by-giorgio-chiellini)
response = urllib2.urlopen(req)
htmlSource = response.read()
soup = BeautifulSoup(htmlSource)
title = soup.find_all("span", {"itemprop": "name"}) # get the title
script_soup = soup.find_all("script")
出于某种原因,script_soup有很多我不需要的信息。我相信我需要的部分在script_soup[9]
,但我不知道如何(以有效的方式)访问它。我真的很感谢你的帮助
如果您可以使用
requests
和lxml
模块,那么您可以使用此模块根据OP更新
数据确实在
script_soup[9]
中。问题是这是一个硬编码在脚本标记中的json
字符串。您可以使用script_soup[9].string
以明文形式获取字符串,然后使用split()
(如我的示例中所示)或regex
提取json
字符串。然后使用json.loads()
将字符串作为python字典加载数据现在存储在变量
data
中。您可以根据需要对其进行解析,也可以使用pd.DataFrame(data)
将其加载到pandas
中相关问题 更多 >
编程相关推荐