使用beautifulsoup和urllib从Json抓取

import requests from bs4 import BeautifulSoup from urllib import urlopen import re import json import time import csv from bs4 import BeautifulSoup as soup from pandas import DataFrame import urllib2 hdr = {"User-Agent": "My Agent"} req = urllib2.Request(http://www.charitystars.com/product/juve-chelsea-3-0-champions-league-jersey-autographed-by-giorgio-chiellini) response = urllib2.urlopen(req) htmlSource = response.read() soup = BeautifulSoup(htmlSource) title = soup.find_all("span", {"itemprop": "name"}) # get the title script_soup = soup.find_all("script")

2条回答

网友

1楼 · 编辑于 2024-10-03 11:14:47

如果您可以使用requests和lxml模块，那么您可以使用此模块

根据OP更新

import requests
from lxml import html
import json

header = {
    'User-Agent': ('Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko)'
                   ' Chrome/85.0.4183.102 Safari/537.36 Edg/85.0.564.51'),
    'X-Requested-With': 'XMLHttpRequest'
}


url='http://www.charitystars.com/product/juve-chelsea-3-0-champions-league-jersey-autographed-by-giorgio-chiellini'
a = requests.get(url,headers=header)

a = html.fromstring(a.text).xpath('//*[@class="page-content"]/script/text()')[0]
a = a.replace('\n','').replace(' ','')
b = a.split(';')
b = [i.split('=') for i in b]
c = json.loads(b[0][1])
c['product']

网友

2楼 · 编辑于 2024-10-03 11:14:47

数据确实在script_soup[9]中。问题是这是一个硬编码在脚本标记中的json字符串。您可以使用script_soup[9].string以明文形式获取字符串，然后使用split()（如我的示例中所示）或regex提取json字符串。然后使用json.loads()将字符串作为python字典加载

import requests
from bs4 import BeautifulSoup
from pandas import DataFrame
import json

hdr = {"User-Agent": "My Agent"}
response = requests.get("http://www.charitystars.com/product/juve-chelsea-3-0-champions-league-jersey-autographed-by-giorgio-chiellini", headers=hdr)

soup = BeautifulSoup(response.content)
script_soup = soup.find_all("script")
data = json.loads(script_soup[9].string.split('= ')[1].split(';')[0])

数据现在存储在变量data中。您可以根据需要对其进行解析，也可以使用pd.DataFrame(data)将其加载到pandas中

根据OP更新

相关问题更多 >

编程相关推荐

热门问题

热门文章