使用Scrapy解析表页并从底层链接中提取数据

2024-05-19 08:37:59 发布

您现在位置:Python中文网/ 问答频道 /正文

我试图在下面的页面中刮取表中的底层数据:https://www.un.org/sc/suborg/en/sanctions/1267/aq_sanctions_list/summaries

我要做的是访问每行的基础链接,并捕获:

  1. ID标签(例如QDE001)
  2. 名字
  3. 上市原因/附加信息
  4. 其他链接实体

这就是我所拥有的,但它似乎不起作用,我不断得到一个“NotImplementedError('{}.parse callback is notdefined'.format(self.classname)。我相信我定义的XPath是正常的,不确定我遗漏了什么。你知道吗

import scrapy
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor

class UNSCItem(scrapy.Item):
    name = scrapy.Field()
    uid = scrapy.Field()
    link = scrapy.Field()
    reason = scrapy.Field()
    add_info = scrapy.Field()



class UNSC(scrapy.Spider):
    name = "UNSC"
    start_urls = [
        'https://www.un.org/sc/suborg/en/sanctions/1267/aq_sanctions_list/summaries?type=All&page=0',      
        'https://www.un.org/sc/suborg/en/sanctions/1267/aq_sanctions_list/summaries?type=All&page=1',
        'https://www.un.org/sc/suborg/en/sanctions/1267/aq_sanctions_list/summaries?type=All&page=2',
        'https://www.un.org/sc/suborg/en/sanctions/1267/aq_sanctions_list/summaries?type=All&page=3',
        'https://www.un.org/sc/suborg/en/sanctions/1267/aq_sanctions_list/summaries?type=All&page=4',
        'https://www.un.org/sc/suborg/en/sanctions/1267/aq_sanctions_list/summaries?type=All&page=5',
        'https://www.un.org/sc/suborg/en/sanctions/1267/aq_sanctions_list/summaries?type=All&page=6',]

    rules = Rule(LinkExtractor(allow=('/sc/suborg/en/sanctions/1267/aq_sanctions_list/summaries/',)),callback='data_extract')


    def data_extract(self, response):
        item = UNSCItem()
        name = response.xpath('//*[@id="content"]/article/div[3]/div//text()').extract()
        uid = response.xpath('//*[@id="content"]/article/div[2]/div/div//text()').extract()
        reason =  response.xpath('//*[@id="content"]/article/div[6]/div[2]/div//text()').extract() 
        add_info = response.xpath('//*[@id="content"]/article/div[7]//text()').extract()
        related = response.xpath('//*[@id="content"]/article/div[8]/div[2]//text()').extract()
        yield item

Tags: httpsorgdivwwwtypealllisten
1条回答
网友
1楼 · 发布于 2024-05-19 08:37:59

尝试下面的方法。它应该从所有六页中获取所有ids和相应的names。我想,剩下的领域你可以自己管理。你知道吗

按原样运行:

import scrapy

class UNSC(scrapy.Spider):
    name = "UNSC"

    start_urls = ['https://www.un.org/sc/suborg/en/sanctions/1267/aq_sanctions_list/summaries?type=All&page={}'.format(page) for page in range(0,7)]

    def parse(self, response):
        for item in response.xpath('//*[contains(@class,"views-table")]//tbody//tr'):
            idnum = item.xpath('.//*[contains(@class,"views-field-field-reference-number")]/text()').extract()[-1].strip()
            name = item.xpath('.//*[contains(@class,"views-field-title")]//span[@dir="ltr"]/text()').extract()[-1].strip()
            yield{'ID':idnum,'Name':name}

相关问题 更多 >

    热门问题