刮花选择器不工作在飞溅响应上

# -*- coding: utf-8 -*- import scrapy from scrapy.selector import Selector from scrapy_splash import SplashRequest class CartierSpider(scrapy.Spider): name = 'cartier' start_urls = ['http://www.cartier.co.uk/en-gb/collections/watches/mens-watches/ballon-bleu-de-cartier/w69017z4-ballon-bleu-de-cartier-watch.html'] def start_requests(self): for url in self.start_urls: yield SplashRequest(url, self.parse, args={'wait': 0.5}) def parse(self, response): yield { 'title': response.xpath('//title').extract(), 'link': response.url, 'productID': Selector(text=response.body).xpath('//span[@itemprop="productID"]/text()').extract(), 'model': Selector(text=response.body).xpath('//span[@itemprop="model"]/text()').extract(), 'price': Selector(text=response.body).css('div.price-wrapper').xpath('.//span[@itemprop="price"]/text()').extract(), }

3条回答

网友

1楼 · 编辑于 2024-10-06 11:21:00

我试过了SplashRequest，遇到了和你一样的问题。在搞乱了之后，我试着执行一个LUA脚本。在

script = """
                function main(splash)
  local url = splash.args.url
  assert(splash:go(url))
  assert(splash:wait(0.5))
  return {
    html = splash:html(),
    png = splash:png(),
    har = splash:har(),
  }
end
"""

然后使用脚本作为参数发出请求。你可以随意修改剧本。在外壳上测试本地主机：9200或你选择的另一个端口。在

^{pr2}$

哦，顺便说一句，你提供信息的方式很奇怪，用物品代替吧。在

网友

2楼 · 编辑于 2024-10-06 11:21:00

你的spider对我来说很好，使用Scrapy 1.1、splash2.1，问题中没有修改代码，只使用https://github.com/scrapy-plugins/scrapy-splash中建议的设置

正如其他人所提到的，您的parse函数可以通过直接使用response.css()和{}来简化，而不需要从响应中重新构建Selector。在

我试过：

import scrapy
from scrapy.selector import Selector
from scrapy_splash import SplashRequest

class CartierSpider(scrapy.Spider):
  name = 'cartier'
  start_urls = ['http://www.cartier.co.uk/en-gb/collections/watches/mens-watches/ballon-bleu-de-cartier/w69017z4-ballon-bleu-de-cartier-watch.html']

  def start_requests(self):
    for url in self.start_urls:
      yield SplashRequest(url, self.parse, args={'wait': 0.5})

  def parse(self, response):
    yield {
      'title': response.xpath('//title/text()').extract_first(),
      'link': response.url,
      'productID': response.xpath('//span[@itemprop="productID"]/text()').extract_first(),
      'model': response.xpath('//span[@itemprop="model"]/text()').extract_first(),
      'price': response.css('div.price-wrapper').xpath('.//span[@itemprop="price"]/text()').extract_first(),
    }

得到了这个：

^{pr2}$

网友

3楼 · 编辑于 2024-10-06 11:21:00

我没有足够的声誉添加评论，所以我必须在这里回答。在

如果我为Splash请求设置'Accept-Encoding': 'gzip'，splash2.1也会遇到类似的问题，返回“格式错误”（未压缩的gzip实际上在讲话）html内容。在

最后，我在这里找到了解决方案：https://github.com/scrapinghub/splash/pull/102 将'Accept-Encoding': 'gzip'更改为： 'Accept-Encoding': 'deflate'

我不知道为什么，但它起作用了。在

相关问题更多 >

编程相关推荐

热门问题

热门文章