官方的小例子出错了？

import scrapy class MySpider(scrapy.Spider): name = 'huffingtonpost' allowed_domains = ['huffingtonpost.com/'] start_urls = [ 'http://www.huffingtonpost.com/politics/', 'http://www.huffingtonpost.com/entertainment/', 'http://www.huffingtonpost.com/media/', ] def parse(self, response): for h3 in response.xpath('//h3').extract(): yield {"title": h3} for url in response.xpath('//a/@href').extract(): yield scrapy.Request(url, callback=self.parse)

1条回答

网友

1楼 · 发布于 2024-06-28 16:40:36

一些提取的链接是相对的（例如，/news/hillary-clinton/）。你应该把它转换成绝对值（http://www.huffingtonpost.com/news/hillary-clinton/

import scrapy

class MySpider(scrapy.Spider):
    name = 'huffingtonpost'
    allowed_domains = ['huffingtonpost.com/']
    start_urls = [
        'http://www.huffingtonpost.com/politics/',
        'http://www.huffingtonpost.com/entertainment/',
        'http://www.huffingtonpost.com/media/',
    ]

    def parse(self, response):
        for h3 in response.xpath('//h3').extract():
            yield {"title": h3}

        for url in response.xpath('//a/@href').extract():
            if url.startswith('/'):
                # transform url into absolute
                url = 'http://www.huffingtonpost.com' + url
            if url.startswith('#'):
                # ignore href starts with #
                continue
            yield scrapy.Request(url, callback=self.parse)

相关问题更多 >

编程相关推荐

热门问题

热门文章