我试图在下面的页面中刮取表中的底层数据:https://www.un.org/sc/suborg/en/sanctions/1267/aq_sanctions_list/summaries
我要做的是访问每行的基础链接,并捕获:
这就是我所拥有的,但它似乎不起作用,我不断得到一个“NotImplementedError('{}.parse callback is notdefined'.format(self.classname)。我相信我定义的XPath是正常的,不确定我遗漏了什么。你知道吗
import scrapy
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
class UNSCItem(scrapy.Item):
name = scrapy.Field()
uid = scrapy.Field()
link = scrapy.Field()
reason = scrapy.Field()
add_info = scrapy.Field()
class UNSC(scrapy.Spider):
name = "UNSC"
start_urls = [
'https://www.un.org/sc/suborg/en/sanctions/1267/aq_sanctions_list/summaries?type=All&page=0',
'https://www.un.org/sc/suborg/en/sanctions/1267/aq_sanctions_list/summaries?type=All&page=1',
'https://www.un.org/sc/suborg/en/sanctions/1267/aq_sanctions_list/summaries?type=All&page=2',
'https://www.un.org/sc/suborg/en/sanctions/1267/aq_sanctions_list/summaries?type=All&page=3',
'https://www.un.org/sc/suborg/en/sanctions/1267/aq_sanctions_list/summaries?type=All&page=4',
'https://www.un.org/sc/suborg/en/sanctions/1267/aq_sanctions_list/summaries?type=All&page=5',
'https://www.un.org/sc/suborg/en/sanctions/1267/aq_sanctions_list/summaries?type=All&page=6',]
rules = Rule(LinkExtractor(allow=('/sc/suborg/en/sanctions/1267/aq_sanctions_list/summaries/',)),callback='data_extract')
def data_extract(self, response):
item = UNSCItem()
name = response.xpath('//*[@id="content"]/article/div[3]/div//text()').extract()
uid = response.xpath('//*[@id="content"]/article/div[2]/div/div//text()').extract()
reason = response.xpath('//*[@id="content"]/article/div[6]/div[2]/div//text()').extract()
add_info = response.xpath('//*[@id="content"]/article/div[7]//text()').extract()
related = response.xpath('//*[@id="content"]/article/div[8]/div[2]//text()').extract()
yield item
尝试下面的方法。它应该从所有六页中获取所有
ids
和相应的names
。我想,剩下的领域你可以自己管理。你知道吗按原样运行:
相关问题 更多 >
编程相关推荐