基于css属性的网页html字符串片段解析

import scrapy class PdgaSpider(scrapy.Spider): name = "pdgavideos" # Name of the Spider, required value start_urls = ["http://www.pdga.com/videos/"] # Entry point for the spiders def parse(self, response): SET_SELECTOR = 'tbody' for brickset in response.css(SET_SELECTOR): HTML_SELECTOR = 'td.views-field.views-field-title a ::attr(href)' yield { 'http://www.pdga.com': brickset.css(HTML_SELECTOR).extract()[0] }

2条回答

网友

1楼 · 编辑于 2024-10-03 06:28:53

你的代码返回一个字典，这就是为什么它是break：

{'http://www.pdga.com': u'/videos/2017-glass-blown-open-fpo-rd-2-pt-2-pierce-fajkus-leatherman-c-allen-sexton-leatherman'}

你能做的就是把这个字典变成这样：

^{pr2}$

这将给您一个新的dict值是no break href。在

{'href_link': u'http://www.pdga.com/videos/2017-glass-blown-open-fpo-rd-2-pt-2-pierce-fajkus-leatherman-c-allen-sexton-leatherman'}

注意：Spider必须返回Request、BaseItem、dict或None，引用parse function。在

网友

2楼 · 编辑于 2024-10-03 06:28:53

为了从相对链接中获取绝对URL，可以使用Scrapyurljoin()方法，并按如下方式重写代码：

import scrapy

class PdgaSpider(scrapy.Spider):
    name = "pdgavideos"
    start_urls = ["http://www.pdga.com/videos/"]

    def parse(self, response):
        for link in response.xpath('//td[2]/a/@href').extract():
            yield scrapy.Request(response.urljoin(link), callback=self.parse_page)

        # If page contains link to next page extract link and parse
        next_page = response.xpath('//a[contains(., "next")]/@href').extract_first()
        if next_page:
            yield scrapy.Request(response.urljoin(next_page), callback=self.parse)

    def parse_page(self, response):
        link = response.xpath('//iframe/@src').extract_first()
        yield{
            'you_tube_link': 'http:' + link.split('?')[0]
        }

# To save links in csv format print in console: scrapy crawl pdgavideos -o links.csv
# http://www.youtube.com/embed/tYBF-BaqVJ8
# http://www.youtube.com/embed/_H0hBBc1Azg
# http://www.youtube.com/embed/HRbKFRCqCos
# http://www.youtube.com/embed/yz3D1sXQkKk
# http://www.youtube.com/embed/W7kuKe2aQ_c

相关问题更多 >

编程相关推荐

热门问题

热门文章