我想刮胡子xkcd.com网站检索所有可用的图像。当我运行我的铲运机,它下载7-8个随机图像的范围www.xkcd.com/1-1461。我希望它能连续浏览每一页,并保存图像,以确保我有一个完整的集。在
下面是我为爬行编写的spider和从scrapy收到的输出:
蜘蛛:
import scrapy
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors import LinkExtractor
from xkcd.items import XkcdItem
class XkcdimagesSpider(CrawlSpider):
name = "xkcdimages"
allowed_domains = ["xkcd.com"]
start_urls = ['http://www.xkcd.com']
rules = [Rule(LinkExtractor(allow=['\d+']), 'parse_xkcd')]
def parse_xkcd(self, response):
image = XkcdItem()
image['title'] = response.xpath(\
"//div[@id='ctitle']/text()").extract()
image['image_urls'] = response.xpath(\
"//div[@id='comic']/img/@src").extract()
return image
输出
^{pr2}$
您需要在crawling rules中设置
follow
参数True
。试试这样的方法:相关问题 更多 >
编程相关推荐