<p>解决方案是设置一个<code>Node.js</code>代理,并配置scray通过<code>http_proxy</code>环境变量使用它。在</p>
<p><a href="http://en.wikipedia.org/wiki/Proxy_server" rel="nofollow">proxy</a>应该做的是:</p>
<ul>
<li>从Scrapy获取HTTP请求并将其发送到正在爬网的服务器。然后它将响应从返回给Scrapy,即截获所有HTTP流量。在</li>
<li>对于二进制文件(基于您实现的启发式),它向Scrapy发送<code>403 Forbidden</code>错误并立即关闭请求/响应。这有助于节省时间,交通和刮擦不会崩溃。在</li>
</ul>
<h3>代理代码示例</h3>
<p>真的很管用!在</p>
<pre><code>http.createServer(function(clientReq, clientRes) {
var options = {
host: clientReq.headers['host'],
port: 80,
path: clientReq.url,
method: clientReq.method,
headers: clientReq.headers
};
var fullUrl = clientReq.headers['host'] + clientReq.url;
var proxyReq = http.request(options, function(proxyRes) {
var contentType = proxyRes.headers['content-type'] || '';
if (!contentType.startsWith('text/')) {
proxyRes.destroy();
var httpForbidden = 403;
clientRes.writeHead(httpForbidden);
clientRes.write('Binary download is disabled.');
clientRes.end();
}
clientRes.writeHead(proxyRes.statusCode, proxyRes.headers);
proxyRes.pipe(clientRes);
});
proxyReq.on('error', function(e) {
console.log('problem with clientReq: ' + e.message);
});
proxyReq.end();
}).listen(8080);
</code></pre>