无法使用python scrapy刮取URL,因为我包含#(URI片段)

2024-10-02 18:26:30 发布

您现在位置:Python中文网/ 问答频道 /正文

我也面临同样的问题。有人能告诉我怎样才能刮取下面提到的URL吗

start_urls = [ 'https://onlinelibrary.ectrims-congress.eu/ectrims/#!*menu=6*browseby=3*sortby=2*media=3*ce_id=1428' ]

我得到的回应是

Crawled (200) <GET https://onlinelibrary.ectrims-congress.eu/ectrims/?_escaped_fragment_=%2Amenu%3D6%2Abrowseby%3D3%2Asortby%3D2%2Amedia%3D3%2Ace_id%3D1428%3E> (referer: None) ['cached']

但不幸的是,我无法提取数据(response.xpath),因为它给了我空值。这是因为当我单击响应URL时,它似乎没有给我想要从中获取数据的确切URL

请帮忙


Tags: httpsidurlurlsmediastartmenuce
1条回答
网友
1楼 · 发布于 2024-10-02 18:26:30

网站

通过查看网站,您可以看到您想要获取的内容是由javascript驱动的,javascript通过发出AJAX请求,增加了通过API端点加载数据的机会。使用chrome开发工具,您可以检查XHR中是否加载了5个请求。但是,此API https://onlinelibrary.ectrims-congress.eu/ectrims/listing/events/banners将在传递所需的参数后为您提供所需的数据,这些参数是header、cookies&;身体发痒

代码

from scrapy import Request

class Ectrims(scrapy.Spider):
    name = 'library'

    headers = {
        "Connection": "keep-alive",
        "sec-ch-ua": "\" Not;A Brand\";v=\"99\", \"Google Chrome\";v=\"91\", \"Chromium\";v=\"91\"",
        "Accept": "application/json, text/javascript, */*; q=0.01",
        "X-Requested-With": "XMLHttpRequest",
        "sec-ch-ua-mobile": "?0",
        "User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.114 Safari/537.36",
        "Content-Type": "application/x-www-form-urlencoded; charset=UTF-8",
        "Origin": "https://onlinelibrary.ectrims-congress.eu",
        "Sec-Fetch-Site": "same-origin",
        "Sec-Fetch-Mode": "cors",
        "Sec-Fetch-Dest": "empty",
        "Referer": "https://onlinelibrary.ectrims-congress.eu/ectrims/",
        "Accept-Language": "en-US,en;q=0.9"
    }

    cookies = {
        "PHPSESSID": "if994kqobo2l80nk1ki7q233i5",
        "_ga": "GA1.2.212877690.1624208120",
        "_gid": "GA1.2.291339791.1624208120",
        "intercom-id-aucjjau5": "bdcafc49-97d0-42fb-b61e-46c74cfed3b0",
        "cp_user_200": "{\"1\":0}"
    }

    body = 'menu=6&browseby=3&sortby=2&media=3&ce_id=1428&getpage=1'

    def start_requests(self):
        url = 'https://onlinelibrary.ectrims-congress.eu/ectrims/listing/events/banners'
        yield Request(url=url, method='POST', cookies=self.cookies, headers=self.headers, body=self.body, callback=self.parse)


    def parse(self, response):
        print(response.body)

如果有帮助,请投票

相关问题 更多 >