碎片处理302重定向

2024-10-06 12:39:08 发布

您现在位置:Python中文网/ 问答频道 /正文

我试着用一个蹩脚的爬行蜘蛛来爬网一个网站问题是这个网站总是以随机模式重定向我,这意味着一个url有时可能会加载,有时会重定向到某个页面我尝试更改我的用户代理,试图通过创建一个类似于发送的http标头来模拟浏览器的行为通过浏览器,甚至当我使用crawlera发送请求时,也没有解决我的问题。如果有人能引导我渡过难关,我会很感激的

控制台:

2017-11-06 02:11:14 [scrapy.core.engine] INFO: Spider opened
2017-11-06 02:11:14 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2017-11-06 02:11:14 [scrapy.extensions.telnet] DEBUG: Telnet console listening on 127.0.0.1:6023
2017-11-06 02:11:15 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (302) to <GET http://www.sears.com/en_intnl/dap/shopping-tourism.html> from <GET http://www.sears.com/en_intnl/dap/shopping-tourism.html>
2017-11-06 02:11:22 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://www.sears.com/en_intnl/dap/shopping-tourism.html> (referer: None)
2017-11-06 02:11:27 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (302) to <GET http://www.sears.com/en_us/botmanagement.html> from <GET http://www.sears.com/en_us/sitemap.html>
2017-11-06 02:11:34 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://www.sears.com/en_us/botmanagement.html> (referer: http://www.sears.com/en_intnl/dap/shopping-tourism.html)
2017-11-06 02:11:34 [scrapy.dupefilters] DEBUG: Filtered duplicate request: <GET http://www.sears.com/gifts/b-1020009> - no more duplicates will be shown (see DUPEFILTER_DEBUG to show all duplicates)
2017-11-06 02:11:39 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://www.sears.com/seasonal-christmas/b-1100100> (referer: http://www.sears.com/en_intnl/dap/shopping-tourism.html)
2017-11-06 02:11:46 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://www.sears.com/toys-games/b-1020010> (referer: http://www.sears.com/en_intnl/dap/shopping-tourism.html)
2017-11-06 02:11:52 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (302) to <GET http://www.sears.com/en_us/botmanagement.html> from <GET http://www.sears.com/home-decor-decorative-accents/b-1348893716>
2017-11-06 02:11:58 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (302) to <GET http://www.sears.com/en_us/botmanagement.html> from <GET http://www.sears.com/tvs-electronics-home-theater-audio-musical-instruments-guitars-string-instruments/b-5000861>
2017-11-06 02:12:03 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (302) to <GET http://www.sears.com/en_us/botmanagement.html> from <GET http://www.sears.com/tvs-electronics-gaming/b-1347529268>

Tags: todebugcomhttpgethtmlwwwen
1条回答
网友
1楼 · 发布于 2024-10-06 12:39:08

如果您没有很多代理:

  • 当您使用parse解析html时
    • 使用if response.url == """http://www.sears.com/en_us/botmanagement.html""":检测是否已重定向到reCAPCHA页面。在
    • 在Scrapy中使用Selenium(Selenium可以直接控制浏览器,因此您可以观察整个刮削过程并手动传递reCAPCHA)(This is an example of how to use selenium with Scrapy
    • 放慢你的刮削速度以防止蜘蛛被发现
  • 公共代理人

相关问题 更多 >