我试图在premier league player stats中应用筛选器时复制请求。我注意到url添加了组件'?在2019/20赛季使用过滤器时,co=1&se=274'
https://www.premierleague.com//players/5140/Virgil-van-Dijk/stats?co=1&se=274
而不是
https://www.premierleague.com//players/5140/Virgil-van-Dijk/stats
但当你这么做的时候
requests.get('https://www.premierleague.com//players/5140/Virgil-van-Dijk/stats?co=1&se=274')
刮去内容,它会被刮去,就好像过滤器没有被使用一样。如何申请网页上的过滤器?你知道吗
通过深入了解,我了解到它受到CloudFront的保护,这意味着在发布请求之前,所有查询参数都被剥离。有办法吗?你知道吗
下面是我如何搜集数据的:
from bs4 import BeautifulSoup as soup
import requests
from tqdm import tqdm
from pprint import pprint
players_url =['https://www.premierleague.com//players/5140/Virgil-van-Dijk/stats?co=1&se=274']
# this is dict where we store all information:
players = {}
for i in tqdm(players_url):
player_page = requests.get(i)
cont = soup(player_page.content, 'lxml')
time.sleep(2)
data = dict((k.contents[0].strip(), v.get_text(strip=True)) for k, v in zip(cont.select('.topStat span.stat, .normalStat span.stat'), cont.select('.topStat span.stat > span, .normalStat span.stat > span')))
clud_ele = cont.find('div', attrs={'class' : 'info'})
club = {"Club" : clud_ele.get_text(strip=True)}
position = {"Position": clud_ele.find_next('div', attrs={'class' : 'info'}).get_text(strip=True)}
data.update(club)
data.update(position)
players[cont.select_one('.playerDetails .name').get_text(strip=True)] = data
pprint(players)
在输出中我可以清楚地看到这个过滤器没有被应用,因为这个赛季没有45场比赛
{'Virgil van Dijk': {'Accurate long balls': '533',
'Aerial battles lost': '207',
'Aerial battles won': '589',
'Appearances': '122',
'Assists': '2',
'Big chances created': '11',
'Blocked shots': '23',
'Clean sheets': '45',
您可以通过复制尝试按季节筛选时完成的后台请求来绕过此问题。我使用
requests
库来获取所有玩家的统计信息这个过程主要涉及三个url
(ex. 274)
https://footballapi.pulselive.com/football/competitions/1/compseasons?page=0&pageSize=100
(ex. Name: Virgil van Dijk, ID: 5140)
https://footballapi.pulselive.com/football/players
(ex. 5140)
获取玩家统计信息https://footballapi.pulselive.com/football/stats/player/
完整脚本
样本输出
数据.json文件包含所有玩家的数据。你知道吗
相关问题 更多 >
编程相关推荐