有没有办法在我的网页抓取器抓取页面之前延迟它?

2024-10-03 13:27:37 发布

您现在位置:Python中文网/ 问答频道 /正文

这就是我的功能:

def clubList(url,yearCode):
    print(url + "/clubs" + yearCode)
    response = requests.get(url + "/clubs" + yearCode)
    time.sleep(10)
    content = response.content
    soup = BeautifulSoup(content, "html.parser")
    cluburl = []
    clubs = []
    ul = soup.find_all(
        "ul",
        attrs={
            "class": "block-list-5 block-list-3-m block-list-1-s block-list-1-xs block-list-padding dataContainer"
        },
    )
    u = str(ul)
    soup2 = BeautifulSoup(u, "html.parser")
    for i, tags in enumerate(soup2.find_all("a")):
        cluburl.append(url + str(tags.get("href")))
    for i in range(0, len(cluburl)):
        cluburl[i] = cluburl[i].replace("overview", "squad")
    return cluburl

我正试图从英超网站上搜集数据,为数据分析项目建立一个统计数据库

我当前的链接树如下所示:

https://www.premierleague.com->https://www.premierleague.com/clubs->https://www.premierleague.com/clubs?se=418

“?se=418”是我添加到链接中的访问代码,用于指定我要查看哪个季节的统计数据,每个季节都有自己独特的代码

我将“https://www.premierleague.com”作为url,将“^se=418”作为年号传递给我的函数,它将返回该特定赛季各个俱乐部页面的链接列表。 但是,它总是返回当前赛季的俱乐部链接列表

我注意到,当我直接访问链接https://www.premierleague.com/clubs?se=418时,它首先加载当前赛季的俱乐部,然后动态刷新相应的俱乐部

因此,我认为添加一个时间延迟可能会奏效,但我猜这是在requests.get语句中解析页面的内容,我不确定应该在哪里添加延迟来实现这一点

此外,以下是运行该功能所需导入的所有模块:

import requests
from bs4 import BeautifulSoup
import pandas as pd
import locale
import time

locale.setlocale(locale.LC_ALL, "en_US.UTF8")

Tags: importcomurlget链接wwwblockrequests
1条回答
网友
1楼 · 发布于 2024-10-03 13:27:37

执行季节过滤器时,它使用以下API:

GET https://footballapi.pulselive.com/football/teams

它需要以下http头来返回数据:account: premierleagueorigin: https://www.premierleague.com

以下示例使用API获取俱乐部列表,然后提取俱乐部id和俱乐部名称以生成俱乐部url:

import requests

season = 418

r = requests.get("https://footballapi.pulselive.com/football/teams", 
    params = {
        "pageSize": 100,
        "compSeasons": season,
        "compCodeForActivePlayer": "null",
        "comps": 1,
        "altIds": "true",
        "page": 0
    },
    headers = {
        "account": "premierleague",
        "origin": "https://www.premierleague.com"
    }
)

data = r.json()
print([
    f'https://www.premierleague.com/clubs/{int(t["club"]["id"])}/{t["club"]["name"].replace(" ","-")}/squad'
    for t in data["content"]
])

相关问题 更多 >