无法让Scrapy解析并遵循301302重定向

1条回答

网友

1楼 · 发布于 2024-09-28 20:51:29

如果您想解析301和302响应，并同时跟踪它们，请请求回调处理301和302，并模仿重定向中间件的行为。

测试1（不工作）

让我们用一个简单的蜘蛛来举例说明（还没有按照您的意愿工作）：

import scrapy


class HandleSpider(scrapy.Spider):
    name = "handle"
    start_urls = (
        'https://httpbin.org/get',
        'https://httpbin.org/redirect-to?url=http%3A%2F%2Fexample.com%2F',
    )
    def parse(self, response):
        self.logger.info("got response for %r" % response.url)

现在，蜘蛛请求两页，第二页应该重定向到http://www.example.com

$ scrapy runspider test.py
2016-09-30 11:28:17 [scrapy] INFO: Scrapy 1.1.3 started (bot: scrapybot)
2016-09-30 11:28:18 [scrapy] DEBUG: Crawled (200) <GET https://httpbin.org/get> (referer: None)
2016-09-30 11:28:18 [scrapy] DEBUG: Redirecting (302) to <GET http://example.com/> from <GET https://httpbin.org/redirect-to?url=http%3A%2F%2Fexample.com%2F>
2016-09-30 11:28:18 [handle] INFO: got response for 'https://httpbin.org/get'
2016-09-30 11:28:18 [scrapy] DEBUG: Crawled (200) <GET http://example.com/> (referer: None)
2016-09-30 11:28:18 [handle] INFO: got response for 'http://example.com/'
2016-09-30 11:28:18 [scrapy] INFO: Spider closed (finished)

302由RedirectMiddleware自动处理，不会传递给回调。

测试2（仍然不完全正确）

让我们将spider配置为处理回调中的301和302，using ^{}：

import scrapy


class HandleSpider(scrapy.Spider):
    name = "handle"
    start_urls = (
        'https://httpbin.org/get',
        'https://httpbin.org/redirect-to?url=http%3A%2F%2Fexample.com%2F',
    )
    handle_httpstatus_list = [301, 302]
    def parse(self, response):
        self.logger.info("got response %d for %r" % (response.status, response.url))

让我们运行它：

$ scrapy runspider test.py
2016-09-30 11:33:32 [scrapy] INFO: Scrapy 1.1.3 started (bot: scrapybot)
2016-09-30 11:33:32 [scrapy] DEBUG: Crawled (200) <GET https://httpbin.org/get> (referer: None)
2016-09-30 11:33:32 [scrapy] DEBUG: Crawled (302) <GET https://httpbin.org/redirect-to?url=http%3A%2F%2Fexample.com%2F> (referer: None)
2016-09-30 11:33:33 [handle] INFO: got response 200 for 'https://httpbin.org/get'
2016-09-30 11:33:33 [handle] INFO: got response 302 for 'https://httpbin.org/redirect-to?url=http%3A%2F%2Fexample.com%2F'
2016-09-30 11:33:33 [scrapy] INFO: Spider closed (finished)

这里，我们错过了重定向。

测试3（工作）

执行same as RedirectMiddleware但在蜘蛛回调中：

from six.moves.urllib.parse import urljoin

import scrapy
from scrapy.utils.python import to_native_str


class HandleSpider(scrapy.Spider):
    name = "handle"
    start_urls = (
        'https://httpbin.org/get',
        'https://httpbin.org/redirect-to?url=http%3A%2F%2Fexample.com%2F',
    )
    handle_httpstatus_list = [301, 302]
    def parse(self, response):
        self.logger.info("got response %d for %r" % (response.status, response.url))

        # do something with the response here...

        # handle redirection
        # this is copied/adapted from RedirectMiddleware
        if response.status >= 300 and response.status < 400:

            # HTTP header is ascii or latin1, redirected url will be percent-encoded utf-8
            location = to_native_str(response.headers['location'].decode('latin1'))

            # get the original request
            request = response.request
            # and the URL we got redirected to
            redirected_url = urljoin(request.url, location)

            if response.status in (301, 307) or request.method == 'HEAD':
                redirected = request.replace(url=redirected_url)
                yield redirected
            else:
                redirected = request.replace(url=redirected_url, method='GET', body='')
                redirected.headers.pop('Content-Type', None)
                redirected.headers.pop('Content-Length', None)
                yield redirected

再运行蜘蛛：

$ scrapy runspider test.py
2016-09-30 11:45:20 [scrapy] INFO: Scrapy 1.1.3 started (bot: scrapybot)
2016-09-30 11:45:21 [scrapy] DEBUG: Crawled (302) <GET https://httpbin.org/redirect-to?url=http%3A%2F%2Fexample.com%2F> (referer: None)
2016-09-30 11:45:21 [scrapy] DEBUG: Crawled (200) <GET https://httpbin.org/get> (referer: None)
2016-09-30 11:45:21 [handle] INFO: got response 302 for 'https://httpbin.org/redirect-to?url=http%3A%2F%2Fexample.com%2F'
2016-09-30 11:45:21 [handle] INFO: got response 200 for 'https://httpbin.org/get'
2016-09-30 11:45:21 [scrapy] DEBUG: Crawled (200) <GET http://example.com/> (referer: https://httpbin.org/redirect-to?url=http%3A%2F%2Fexample.com%2F)
2016-09-30 11:45:21 [handle] INFO: got response 200 for 'http://example.com/'
2016-09-30 11:45:21 [scrapy] INFO: Spider closed (finished)

我们被重定向到http://www.example.com并通过回调获得响应。

测试1（不工作）

测试2（仍然不完全正确）

测试3（工作）

相关问题更多 >

编程相关推荐

热门问题

热门文章

无法让Scrapy解析并遵循301302重定向

测试1（不工作）

测试2（仍然不完全正确）

测试3（工作）

相关问题 更多 >

编程相关推荐

热门问题

热门文章

相关问题更多 >