乱七八糟的python规则不起作用

2024-09-29 20:16:31 发布

您现在位置:Python中文网/ 问答频道 /正文

我可以把craiglist的第一页废了。但Linkextractor并没有从其他页面获取数据。我在定义规则时是不是做错了什么?在

import scrapy
from craiglist.items import craiglistItem
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor


class ExampleSpider(CrawlSpider):
    name = "craiglist"
    allowed_domains = ["craiglist.org"]
    start_urls = (
        'http://sfbay.craigslist.org/search/npo',
    )
    rules = [
         Rule(LinkExtractor(restrict_xpaths='//a[@class="button next"]'),     callback='parse', follow= True)
    ]

    def parse(self, response):
        titles = response.selector.xpath('//*[@id="sortable-results"]/ul/li/p')
        items = []
        for title in titles:
            item = craiglistItem()
            item["title"] = title.select("a/text()").extract()
            item["link"] = title.select("a/@href").extract()
            items.append(item)
        return items

Tags: fromorgimporttitleparseresponseitemsitem
1条回答
网友
1楼 · 发布于 2024-09-29 20:16:31

我已经修改了代码,现在它可以正常工作了。下面是工作代码。在

import scrapy
from craiglist.items import craiglistItem
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
from scrapy.http import Request


class ExampleSpider(CrawlSpider):
    name = "craiglist"
    allowed_domains = ["craigslist.org"]
    start_urls = (
        'http://sfbay.craigslist.org/search/npo',
    )
    rules = [
        Rule(LinkExtractor(restrict_xpaths='//a[@class="button next"]'), callback="parse_items", follow= True),
    ]

    def parse_start_url(self,response):
        request=Request("http://sfbay.craigslist.org/search/npo", callback=self.parse_items)
        return request

    def parse_items(self, response):
        titles = response.selector.xpath('//*[@id="sortable-results"]/ul/li/p')
        items = []
        for title in titles:
            item = craiglistItem()
            item["title"] = title.select("a/text()").extract()
            item["link"] = title.select("a/@href").extract()
            #item["link"] = response.url
            items.append(item)
        return items

相关问题 更多 >

    热门问题