刮伤：不清楚爬行蜘蛛是否正确跟踪分页

2024-09-26 18:15:02 发布

您现在位置：Python中文网/ 问答频道 /正文

7789

网友

男 | 程序猿一只，喜欢编程写python代码。

设置

我正在用scrapy来刮房屋广告，下面的例子是here。在

在我的例子中，我跟随链接到房屋广告页面而不是作者页面，然后刮取房屋广告页面以获取信息。在

我的蜘蛛正确地从一个页面上的广告中获取信息，并跟随分页链接到下一个页面。

问题

示例中只使用了一个起始url，但我希望从大量url中“start”，这些url都位于一个页面上。为了获得这些网址，我换成了爬行蜘蛛。在

爬行蜘蛛获得网址，但（似乎）随机地混合在跟随分页链接和抓取房屋广告之间

目前为止的代码

class RoomsSpider(CrawlSpider):
    name = 'rooms'
    allowed_domains = ['spareroom.co.uk']
    start_urls = ['https://www.spareroom.co.uk/flatshare/london']

    # rules obtains the 'starting' urls
    rules = [Rule(LinkExtractor(restrict_xpaths=(
        '//*[@id="spareroom"]/div[2]/aside[2]/ul/li/a',
        ),),callback='parse_page')]

    def parse_page(self, response):
        # follow links to ad pages
        for href in response.xpath(
               '//*[@id="maincontent"]/ul/li/article/header[1]',
               ).css('a::attr(href)').extract():
            yield scrapy.Request(response.urljoin(href),
                             callback=self.parse_ad)

        # follow pagination links
        next_page = response.xpath(
               '//*[@id="maincontent"]/div[2]/ul[2]/li/strong/a/@href',
               ).extract_first()   
        if next_page is not None:
             next_page = response.urljoin(next_page)
             yield scrapy.Request(next_page, callback=self.parse_page)      

     def parse_ad(self, response):
     # code extracting ad information follows here, 
     # finalising the code with a yield function.

理想情况下，每个起始url蜘蛛都会跟踪它的广告链接，抓取它的广告并跟踪它的分页链接，直到没有更多的分页链接为止。在

也许我应该添加一个Rule来获得分页链接，并消除{}部分？我不知道该怎么办。在

Tags： self url parse 链接 response page 页面 ad

0条回答

目前没有回答

刮伤：不清楚爬行蜘蛛是否正确跟踪分页

相关问题更多 >

编程相关推荐

热门问题

热门文章

刮伤：不清楚爬行蜘蛛是否正确跟踪分页

相关问题 更多 >

编程相关推荐

热门问题

热门文章

相关问题更多 >