我写的网络爬虫有点问题。我想保存我获取的数据。如果我从scrapy教程中正确理解的话,我只需要给出它,然后使用scrapy crawl <crawler> -o file.csv -t csv
启动爬虫程序,对吗?由于某些原因,该文件仍为空。这是我的密码:
# -*- coding: utf-8 -*-
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
class PaginebiancheSpider(CrawlSpider):
name = 'paginebianche'
allowed_domains = ['paginebianche.it']
start_urls = ['https://www.paginebianche.it/aziende-clienti/lombardia/milano/comuni.htm']
rules = (
Rule(LinkExtractor(allow=(), restrict_css = ('.seo-list-name','.seo-list-name-up')),
callback = "parse_item",
follow = True),)
def parse_item(self, response):
if(response.xpath("//h2[@class='rgs']//strong//text()") != [] and response.xpath("//span[@class='value'][@itemprop='telephone']//text()") != []):
yield ' '.join(response.xpath("//h2[@class='rgs']//strong//text()").extract()) + " " + response.xpath("//span[@class='value'][@itemprop='telephone']//text()").extract()[0].strip(),
我使用的是python2.7
如果您查看spider的输出,您将看到一堆错误消息,如下面所示:
这意味着您没有生成正确的内容—您需要dict或
Item
,而不是您正在创建的单个项元组。像这样简单的事情应该可以做到:
相关问题 更多 >
编程相关推荐