基于Python-Scrapy-mimetype的过滤器可避免非文本文件下载

3条回答

网友

1楼 · 编辑于 2024-10-01 07:43:04

解决方案是设置一个Node.js代理，并配置scray通过http_proxy环境变量使用它。在

proxy应该做的是：

从Scrapy获取HTTP请求并将其发送到正在爬网的服务器。然后它将响应从返回给Scrapy，即截获所有HTTP流量。在
对于二进制文件（基于您实现的启发式），它向Scrapy发送403 Forbidden错误并立即关闭请求/响应。这有助于节省时间，交通和刮擦不会崩溃。在

代理代码示例

真的很管用！在

http.createServer(function(clientReq, clientRes) {
    var options = {
        host: clientReq.headers['host'],
        port: 80,
        path: clientReq.url,
        method: clientReq.method,
        headers: clientReq.headers
    };


    var fullUrl = clientReq.headers['host'] + clientReq.url;

    var proxyReq = http.request(options, function(proxyRes) {
        var contentType = proxyRes.headers['content-type'] || '';
        if (!contentType.startsWith('text/')) {
            proxyRes.destroy();            
            var httpForbidden = 403;
            clientRes.writeHead(httpForbidden);
            clientRes.write('Binary download is disabled.');
            clientRes.end();
        }

        clientRes.writeHead(proxyRes.statusCode, proxyRes.headers);
        proxyRes.pipe(clientRes);
    });

    proxyReq.on('error', function(e) {
        console.log('problem with clientReq: ' + e.message);
    });

    proxyReq.end();

}).listen(8080);

网友

2楼 · 编辑于 2024-10-01 07:43:04

我构建这个中间件是为了排除不在正则表达式白名单中的任何响应类型：

from scrapy.http.response.html import HtmlResponse
from scrapy.exceptions import IgnoreRequest
from scrapy import log
import re

class FilterResponses(object):
    """Limit the HTTP response types that Scrapy dowloads."""

    @staticmethod
    def is_valid_response(type_whitelist, content_type_header):
        for type_regex in type_whitelist:
            if re.search(type_regex, content_type_header):
                return True
        return False

    def process_response(self, request, response, spider):
        """
        Only allow HTTP response types that that match the given list of 
        filtering regexs
        """
        # each spider must define the variable response_type_whitelist as an
        # iterable of regular expressions. ex. (r'text', )
        type_whitelist = getattr(spider, "response_type_whitelist", None)
        content_type_header = response.headers.get('content-type', None)
        if not type_whitelist:
            return response
        elif not content_type_header:
            log.msg("no content type header: {}".format(response.url), level=log.DEBUG, spider=spider)
            raise IgnoreRequest()
        elif self.is_valid_response(type_whitelist, content_type_header):
            log.msg("valid response {}".format(response.url), level=log.DEBUG, spider=spider)
            return response
        else:
            msg = "Ignoring request {}, content-type was not in whitelist".format(response.url)
            log.msg(msg, level=log.DEBUG, spider=spider)
            raise IgnoreRequest()

要使用它，请将其添加到设置.py公司名称：

^{pr2}$

网友

3楼 · 编辑于 2024-10-01 07:43:04

也许已经很晚了。您可以使用Accept头来过滤要查找的数据。在

代理代码示例

相关问题更多 >

编程相关推荐

热门问题

热门文章

基于Python-Scrapy-mimetype的过滤器可避免非文本文件下载

代理代码示例

相关问题 更多 >

编程相关推荐

热门问题

热门文章

相关问题更多 >