蜘蛛错误URL处理

class GoodWillOutSpider(Spider): name = "GoodWillOutSpider" allowded_domains = ["thegoodwillout.com"] start_urls = [GoodWillOutURL] def __init__(self): logging.critical("GoodWillOut STARTED.") def parse(self, response): products = Selector(response).xpath('//div[@id="elasticsearch-results-container"]/ul[@class="product-list clearfix"]') for product in products: item = GoodWillOutItem() item['name'] = product.xpath('//div[@class="name ng-binding"]').extract()[0] item['link'] = "www.thegoodwillout.com" + product.xpath('//@href').extract()[0] # item['image'] = "http:" + product.xpath("/div[@class='catalogue-product-cover']/a[@class='catalogue-product-cover-image']/img/@src").extract()[0] # item['size'] = '**NOT SUPPORTED YET**' yield item yield Request(GoodWillOutURL, callback=self.parse, dont_filter=True, priority=16)

[scrapy.core.scraper] ERROR: Spider error processing <GET https://www.thegoodwillout.com/footwear> (referer: None) line 1085, in parse item['name'] = product.xpath('//div[@class="name ng-binding"]').extract()[0] IndexError: list index out of range

2条回答

网友

1楼 · 编辑于 2024-10-04 05:31:59

问题

如果您的scraper无法访问您可以使用浏览器开发工具看到的数据，那么它将无法看到与您的浏览器相同的数据。你知道吗

这可能意味着两件事之一：

你的刮刀被认为是这样，并提供不同的内容
一些内容是动态生成的（通常通过javascript）

通用解决方案

解决这两个问题最直接的方法是使用实际的浏览器。你知道吗

有许多无头浏览器可用，您可以根据自己的需要选择最好的。
对于scrapy，scrapy-splash可能是最简单的选择。你知道吗

更专业的解决方案

有时，您可以找出这种不同行为的原因，然后更改代码。
这通常是更有效的解决方案，但可能需要您做更多的工作。你知道吗

例如，如果您的scraper被重定向，那么您可能只需要使用不同的用户代理字符串，传递一些附加的头，或者减慢您的请求。你知道吗

如果内容是由javascript生成的，那么您可以查看页面源代码（response.text或在浏览器中查看源代码），并找出发生了什么。你知道吗

之后，有两种可能：

以另一种方式提取数据（就像gangabass对上一个问题所做的那样）
复制javascript在spider代码中所做的事情（比如发出额外的请求，如当前示例中所示）

网友

2楼 · 编辑于 2024-10-04 05:31:59

IndexError: list index out of range

您需要首先检查列表在提取后是否有任何值

item['name'] = product.xpath('//div[@class="name ng-binding"]').extract()
if item['name']:
    item['name'] = item['name'][0]

问题

通用解决方案

更专业的解决方案

相关问题更多 >

编程相关推荐

热门问题

热门文章