爬了(200),但没有刮到C

2024-07-07 05:39:22 发布

您现在位置:Python中文网/ 问答频道 /正文

嗨再次我是在C10计划和试图刮亚马逊网站

我有这个问题,有时日志说一个网站是爬网,但它不会刮取我想要的数据,它跳到下一页,我的指示。从某些页面上看,它会从一些页面上刮下来,我不明白。就像我检查了url的代码和html,有一些项目要在网站上被抓取,它说是爬网,但没有抓取。有人能帮我了解一下发生了什么事吗?我在想也许网站会返回一个验证码,但即便如此,我还是认为crawlera会自动重试它获取验证码的请求。在

以下是日志:

'time': '2017-02-12',
'title': u'Basic GIS Coordinates, Second Edition',
'url': u'https://www.amazon.com/Basic-GIS-Coordinates-Second-Sickle/dp/1420092316/ref=sr_1_64?s=tradein-aps&srs=9187220011&ie=UTF8&qid=1486932384&sr=1-64'}
2017-02-12 14:46:31 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.amazon.com/s//s/ref=sr_nr_n_3/153-6246827-9833634?srs=9187220011&fst=as%3Aoff&rh=n%3A283155%2Cn%3A%211000%2Cn%3A173507%2Cn%3A173515%2Cn%3A227541%2Cn%3A13735&bbn=227541&ie=UTF8&qid=1486860051&rnid=227541> (referer: None)
2017-02-12 14:46:42 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.amazon.com/s//s/ref=sr_nr_n_2/153-6246827-9833634?srs=9187220011&fst=as%3Aoff&rh=n%3A283155%2Cn%3A%211000%2Cn%3A173507%2Cn%3A173515%2Cn%3A227541%2Cn%3A52187011&bbn=227541&ie=UTF8&qid=1486860051&rnid=227541> (referer: None)
2017-02-12 14:46:44 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.amazon.com/s/ref=sr_pg_2/153-6246827-9833634?bbn=227541&fst=as%3Aoff&ie=UTF8&page=2&qid=1486932385&rh=n%3A283155%2Cn%3A%211000%2Cn%3A173507%2Cn%3A173515%2Cn%3A227541%2Cn%3A13735&srs=9187220011> (referer: https://www.amazon.com/s//s/ref=sr_nr_n_3/153-6246827-9833634?srs=9187220011&fst=as%3Aoff&rh=n%3A283155%2Cn%3A%211000%2Cn%3A173507%2Cn%3A173515%2Cn%3A227541%2Cn%3A13735&bbn=227541&ie=UTF8&qid=1486860051&rnid=227541)
2017-02-12 14:46:44 [scrapy.log] DEBUG: successfully added!
2017-02-12 14:46:44 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.amazon.com/s/ref=sr_pg_2/153-6246827-9833634?bbn=227541&fst=as%3Aoff&ie=UTF8&page=2&qid=1486932385&rh=n%3A283155%2Cn%3A%211000%2Cn%3A173507%2Cn%3A173515%2Cn%3A227541%2Cn%3A13735&srs=9187220011>
{'currency': u'$',

Tags: httpsdebugcomrefamazonwwwutf8ie
1条回答
网友
1楼 · 发布于 2024-07-07 05:39:22

当你在亚马逊上爬行时,我猜你得到的是一个“验证码”页面,而不是一个普通的产品页面。在

也许你应该打印你的回复内容,而不是仅仅返回项目,然后你就可以确定到底哪个页面被爬网了。在

相关问题 更多 >