爬网amazon时出现scrapy/selectorlib错误

# -*- coding: utf-8 -*- import scrapy import os import selectorlib class AmazonSpider(scrapy.Spider): name = 'amazon' allowed_domains = ['www.amazon.it'] start_urls = ['https://www.amazon.it/gp/goldbox'] # Create Extractor for listing page listing_page_extractor = selectorlib.Extractor.from_yaml_file(os.path.join(os.path.dirname(__file__),'../selectorlib_yaml/urlSelector.yml')) # Create Extractor for product page product_page_extractor = selectorlib.Extractor.from_yaml_file(os.path.join(os.path.dirname(__file__),'../selectorlib_yaml/selector.yml')) def parse(self, response): # Extract data using Extractor data = self.listing_page_extractor.extract(response.text) if 'next' in data: # Printing for debug print(data['next']) yield scrapy.Request(data['next'],callback=self.parse) for p in data['product_page']: yield scrapy.Request(p,callback=self.parse_product) def parse_product(self, response): # Extract data using Extractor product = self.product_page_extractor.extract(response.text) if product: yield product

title: css: h1.a-size-large type: Text category: css: 'div.a-subheader a.a-link-normal' multiple: true type: Text price: css: 'td.a-span12 span.a-size-medium.a-color-price' type: Text delivery: css: 'span.a-size-medium span.a-size-base' type: Text fullprice: css: span.priceBlockStrikePriceString type: Text discount: css: td.a-span12.a-color-price type: Text availability: css: 'div.a-section div.feature div.a-section div.a-section span.a-size-medium' type: Text time: css: 'td.a-span12 div.a-row span.a-size-base.a-color-base' type: Text promotion: css: 'div.a-popover-content li' type: Text stars: css: 'div.a-icon-row span.a-size-medium, div.a-popover-content span.a-size-base a.a-link-normal' multiple: true type: Text votes: css: 'span.a-declarative a.a-link-normal span.a-size-base' multiple: true type: Text ASIN: css: 'div.column.col2 tr:nth-of-type(1) td.value' type: Text image: css: img.fullscreen multiple: true type: Attribute attribute: src description: css: 'div.a-row div.a-section p:nth-of-type(1)' type: Text

2条回答

网友

1楼 · 编辑于 2024-09-27 23:17:21

正如@eLRuLL提到的，分页div id是动态生成的。你知道吗

必须使用一些驱动程序在页面上呈现javascript，比如无头浏览器。或者scrapy-recommended飞溅。你知道吗

from scrapy_splash import SplashRequest
...

for p in data['product_page']:
    yield SplashRequest(p,
                        callback=self.parse_product,
                        args={'wait': 0,5},
                        endpoint = 'render.html')

使用selectorlib可以使用xpath选择器，它包含“pagination next”。你知道吗

https://selectorlib.readthedocs.io/en/latest/usage.html#xpath-default-blank

你知道吗URL选择器.yml你知道吗

product_page:
    css: a.a-size-base
    multiple: true
    type: Link
next:
    xpath: '//div[contains(@id, "pagination-next")]//li[@class="a-last"]/a/@href'
    type: Link

网友

2楼 · 编辑于 2024-09-27 23:17:21

这是一个非常有趣的问题，看起来亚马逊越来越难解析他们的回复，主要是因为每个人都想要他们公开共享的数据。你知道吗

下面是Amazon正在做的3件有趣的事情，您必须更清楚地调试它们：

html结构是动态创建的，响应体实际上不是可解析的html（你可以用selectorib chrome扩展得到那些css，因为chrome已经用javascript重新格式化了响应体的所有工作，记住scrapy不是浏览器，它可以处理来自请求和响应的简单纯文本。你知道吗
分页链接标识符（在您的例子中：div#pagination-next-30159412167606625）也是动态创建的，这个数字在加载时随机生成。你可以检查它重新加载的网站，并检查该数字的变化每一次。你知道吗
分页链接也是动态生成的，您尝试查找的链接不在您尝试查找的元素（下一个页面元素）中，它实际上是用javascript创建的，正在调用一个json可解析元素，并重新填充站点。你知道吗

很抱歉，我不能提供实际的代码和更多的方向如何解决您的问题，但理解和创建实际的蜘蛛将需要大量的编码工作。你知道吗

相关问题更多 >

编程相关推荐

热门问题

热门文章