刮伤：蜘蛛优化

class GeneralSpider(CrawlSpider): name = 'domain' allowed_domains = ['domain.org'] start_urls = ['http://www.domain.org/home'] def parse(self, response): links = LinksItem() links['content'] = response.xpath("//div[@id='h45F23']").extract() return links

项目蜘蛛：

哪种方法是最好的让蜘蛛跟随一个网址的分页？在

如果分页是JQuery，这意味着URL中没有GET variable，那么可以在分页之后进行分页吗？

我可以在同一个spider中使用不同的“规则”来删除页面的不同部分吗？每只蜘蛛都有一只更专注的东西？

我也在google上搜索过任何与Scrapy有关的书，但似乎还没有完成的书，或者至少我找不到一本。在

有人知道是否有一本很快就会发行的书？在

编辑：

这个2个URL适合这个例子。在Eroski Home页面中，您可以获得产品页面的URL。在

在产品页面中，您有一个分页的项目列表（Eroski items）：

获取链接的URL:Eroski Home

获取项目的URL:Eroski Fruits

在Eroski Fruits页面中，项目的分页似乎是JQuery/AJAX，因为向下滚动时会显示更多的项目，有没有办法用Scrapy获取所有这些项目？在

1条回答

网友

1楼 · 发布于 2024-10-04 01:32:02

Which is the best way to make the spider follow the pagination of an url ?

这是非常特定于站点的，取决于分页的实现方式。在

If the pagination is JQuery, meaning there is no GET variable in the URL, Would be possible to follow the pagination ?

这正是您的用例-分页是通过附加的AJAX调用进行的，您可以在Scrapy spider中模拟这些调用。在

Can I have different "rules" in the same spider to scrape different parts of the page ? or is better to have the spiders specialized, each spider focused in one thing?

是的，CrawlSpider提供的“规则”机制是一项非常强大的技术—它是高度可配置的—您可以有多个规则，其中一些规则遵循与特定条件匹配的特定链接，或者位于页面的特定部分。与拥有多个spider相比，具有多个规则的单个spider应该是首选。在

关于您的具体用例，我们的想法是：

做一个rule来跟随主页导航菜单中的类别和子类别-这就是restrict_xpaths会有帮助的
在回调中，对于每个类别或子类别yielda Request，它将模拟浏览器在打开类别页面时发送的AJAX请求
在AJAX响应处理程序（回调）中，解析可用项并yield另一个Request，用于相同的类别/子类别，但增加pageGET参数（获取下一页）

示例工作实现：

import re
import urllib

import scrapy
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors import LinkExtractor


class ProductItem(scrapy.Item):
    description = scrapy.Field()
    price = scrapy.Field()


class GrupoeroskiSpider(CrawlSpider):
    name = 'grupoeroski'
    allowed_domains = ['compraonline.grupoeroski.com']
    start_urls = ['http://www.compraonline.grupoeroski.com/supermercado/home.jsp']

    rules = [
        Rule(LinkExtractor(restrict_xpaths='//div[@class="navmenu"]'), callback='parse_categories')
    ]

    def parse_categories(self, response):
        pattern = re.compile(r'/(\d+)\-\w+')
        groups = pattern.findall(response.url)
        params = {'page': 1, 'categoria': groups.pop(0)}

        if groups:
            params['grupo'] = groups.pop(0)
        if groups:
            params['familia'] = groups.pop(0)

        url = 'http://www.compraonline.grupoeroski.com/supermercado/ajax/listProducts.jsp?' + urllib.urlencode(params)
        yield scrapy.Request(url,
                             meta={'params': params},
                             callback=self.parse_products,
                             headers={'X-Requested-With': 'XMLHttpRequest'})

    def parse_products(self, response):
        for product in response.xpath('//div[@class="product_element"]'):
            item = ProductItem()
            item['description'] = product.xpath('.//span[@class="description_1"]/text()').extract()[0]
            item['price'] = product.xpath('.//div[@class="precio_line"]/p/text()').extract()[0]
            yield item

        params = response.meta['params']
        params['page'] += 1

        url = 'http://www.compraonline.grupoeroski.com/supermercado/ajax/listProducts.jsp?' + urllib.urlencode(params)
        yield scrapy.Request(url,
                             meta={'params': params},
                             callback=self.parse_products,
                             headers={'X-Requested-With': 'XMLHttpRequest'})

希望这对你来说是个好的起点。在

Does anyone know if some Scrapy book that will be released soon?

没什么特别的事我记得。在

^{虽然我听说有些出版商计划发行一本关于网络抓取的书，但我不想告诉你这些。}

一般规定：

项目蜘蛛：

编辑：

相关问题更多 >

编程相关推荐

热门问题

热门文章