我正在使用Scrapy (0.22)爬网一个网站。我需要做三件事:
但是现在我被屏蔽了,我用‘管道’下载图片,但是我的代码不能正常工作,它不能把图片下载到本地。在
另外,既然我想把信息存储在Mongo中,有人可以给我一些关于“Mongo表结构”的建议吗?在
我的代码如下:
设置.py
BOT_NAME = 'tutorial'
SPIDER_MODULES = ['tutorial.spiders']
NEWSPIDER_MODULE = 'tutorial.spiders'
ITEM_PIPELINES = {'tutorial.pipelines.TutorialPipeline': 1}
IMAGES_STORE = '/ttt'
项目.py
^{pr2}$管道.py
from scrapy.contrib.pipeline.images import ImagesPipeline
from scrapy.exceptions import DropItem
from scrapy.http import Request
from pprint import pprint as pp
class TutorialPipeline(object):
# def get_media_requests(self, item, info):
# for image_url in item['image_urls']:
# yield Request(image_url)
# def process_item(self, item, spider):
# print '**********************===================*******************'
# return item
# pp(item)
# pass
def get_media_requests(self,item,info):
# pass
pp('**********************===================*******************')
# yield Request(item['image_urls'])
for image_url in item['image_urls']:
# pass
# print image_url
yield Request(image_url)
蜘蛛网.py
import scrapy
import os
from pprint import pprint as pp
from scrapy import log
from scrapy.http import Request
from scrapy.selector import Selector
from scrapy.spider import Spider
from scrapy.spider import Spider
from scrapy.selector import Selector
from tutorial.items import TutorialItem
from pprint import pprint as pp
class BaiduSpider(scrapy.spider.Spider):
name='baidu'
start_urls=[
# 'http://www.dmoz.org/Computers/Programming/Languages/Python/Books/'
'http://giphy.com/categories'
]
domain='http://giphy.com'
def parse(self,response):
selector=Selector(response)
topCategorys=selector.xpath('//div[@id="None-list"]/a')
# pp(topCategorys)
items=[]
for tc in topCategorys:
item=TutorialItem()
item['catname']=tc.xpath('./text()').extract()[0]
item['caturl']=tc.xpath('./@href').extract()[0]
if item['catname']==u'ALL':
continue
reqUrl=self.domain+'/'+item['caturl']
# pp(reqUrl)
yield Request(url=reqUrl,meta={'caturl':reqUrl},callback=self.getSecondCategory)
def getSecondCategory(self,response):
selector=Selector(response)
# pp(response.meta['caturl'])
# pp('*****************=================**************')
secondCategorys=selector.xpath('//div[@class="grid_9 omega featured-category-tags"]/div/a')
# pp(secondCategorys)
items=[]
for sc in secondCategorys:
item=TutorialItem()
item['catname']=sc.xpath('./div/h4/text()').extract()[0]
item['caturl']=sc.xpath('./@href').extract()[0]
items.append(item)
reqUrl=self.domain+item['caturl']
# pp(items)
# pp(item)
# pp(reqUrl)
yield Request(url=reqUrl,meta={'caturl':reqUrl},callback=self.getImages)
def getImages(self,response):
selector=Selector(response)
# pp(response.meta['caturl'])
# pp('*****************=================**************')
# images=selector.xpath('//ul[@class="gifs freeform grid_12"]/div[position()=3]')
images=selector.xpath('//*[contains (@class,"hoverable-gif")]')
# images=selector.xpath('//ul[@class="gifs freeform grid_12"]//div[@class="hoverable-gif"]')
# pp(len(images))
items=[]
for image in images:
item=TutorialItem()
item['image_urls']=image.xpath('./a/figure/img/@src').extract()[0]
# item['imgName']=image.xpath('./a/figure/img/@alt').extract()[0]
items.append(item)
# pp(item)
# pp(items)
# pp('==============************==============')
# pp(items)
# items=[{'images':"hello world"}]
return items
另外,输出中没有错误,如下所示:
2014-12-21 13:49:56+0800 [scrapy] INFO: Enabled item pipelines: TutorialPipeline
2014-12-21 13:49:56+0800 [baidu] INFO: Spider opened
2014-12-21 13:49:56+0800 [baidu] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2014-12-21 13:49:56+0800 [scrapy] DEBUG: Telnet console listening on 0.0.0.0:6023
2014-12-21 13:49:56+0800 [scrapy] DEBUG: Web service listening on 0.0.0.0:6080
2014-12-21 13:50:07+0800 [baidu] DEBUG: Crawled (200) <GET http://giphy.com/categories> (referer: None)
2014-12-21 13:50:08+0800 [baidu] DEBUG: Crawled (200) <GET http://giphy.com//categories/science/> (referer: http://giphy.com/categories)
2014-12-21 13:50:08+0800 [baidu] DEBUG: Crawled (200) <GET http://giphy.com//categories/sports/> (referer: http://giphy.com/categories)
2014-12-21 13:50:08+0800 [baidu] DEBUG: Crawled (200) <GET http://giphy.com//categories/news-politics/> (referer: http://giphy.com/categories)
2014-12-21 13:50:09+0800 [baidu] DEBUG: Crawled (200) <GET http://giphy.com//categories/transportation/> (referer: http://giphy.com/categories)
2014-12-21 13:50:09+0800 [baidu] DEBUG: Crawled (200) <GET http://giphy.com//categories/interests/> (referer: http://giphy.com/categories)
2014-12-21 13:50:09+0800 [baidu] DEBUG: Crawled (200) <GET http://giphy.com//categories/memes/> (referer: http://giphy.com/categories)
2014-12-21 13:50:09+0800 [baidu] DEBUG: Crawled (200) <GET http://giphy.com//categories/tv/> (referer: http://giphy.com/categories)
2014-12-21 13:50:09+0800 [baidu] DEBUG: Crawled (200) <GET http://giphy.com//categories/gaming/> (referer: http://giphy.com/categories)
2014-12-21 13:50:10+0800 [baidu] DEBUG: Crawled (200) <GET http://giphy.com//categories/nature/> (referer: http://giphy.com/categories)
2014-12-21 13:50:10+0800 [baidu] DEBUG: Crawled (200) <GET http://giphy.com//categories/emotions/> (referer: http://giphy.com/categories)
2014-12-21 13:50:10+0800 [baidu] DEBUG: Crawled (200) <GET http://giphy.com//categories/movies/> (referer: http://giphy.com/categories)
2014-12-21 13:50:10+0800 [baidu] DEBUG: Crawled (200) <GET http://giphy.com//categories/holiday/> (referer: http://giphy.com/categories)
2014-12-21 13:50:11+0800 [baidu] DEBUG: Crawled (200) <GET http://giphy.com//categories/reactions/> (referer: http://giphy.com/categories)
2014-12-21 13:50:11+0800 [baidu] DEBUG: Crawled (200) <GET http://giphy.com//categories/music/> (referer: http://giphy.com/categories)
2014-12-21 13:50:11+0800 [baidu] DEBUG: Crawled (200) <GET http://giphy.com//categories/decades/> (referer: http://giphy.com/categories)
2014-12-21 13:50:12+0800 [baidu] DEBUG: Crawled (200) <GET http://giphy.com/search/the-colbert-report/> (referer: http://giphy.com//categories/news-politics/)
2014-12-21 13:50:12+0800 [baidu] DEBUG: Scraped from <200 http://giphy.com/search/the-colbert-report/>
{'image_urls': u'http://media1.giphy.com/media/2BDLDXFaEiuBy/200_s.gif'}
2014-12-21 13:50:12+0800 [baidu] DEBUG: Scraped from <200 http://giphy.com/search/the-colbert-report/>
{'image_urls': u'http://media2.giphy.com/media/WisjAI5QGgsrC/200_s.gif'}
2014-12-21 13:50:12+0800 [baidu] DEBUG: Scraped from <200 http://giphy.com/search/the-colbert-report/>
{'image_urls': u'http://media3.giphy.com/media/ZgDGEMihlZXCo/200_s.gif'}
.............
在我看来,您没有必要重写
ImagesPipeline
,因为您没有修改它的行为。但是,既然你在做,你就应该好好做。重写
ImagesPipeline
时,应重写两个方法:获取媒体请求(item,info)应该为
image_urls
中的每个URL返回一个Request
。这部分你做得对。当单个项目的所有图像请求都已完成(下载完成或由于某种原因失败)时,调用item_completed(results,items,info)。从official documentation:
因此,要使自定义图像管道正常工作,需要重写item\u completed()方法,如下所示:
此外,关于代码中导致代码无法按预期工作的其他问题:
您实际上没有创建任何有用的项目。
如果您查看一下您的
parse()
和getSecondCategory()
函数,您将注意到您没有返回或生成任何项。虽然您似乎已经准备好了items
列表,很明显您想用它来存储项目,但它从来没有被用于实际地将项目传递到处理路径的下一步。在某一点上,您只需为下一页生成一个Request
,当该函数完成时,items
将被删除。您没有使用您通过
meta
字典传递的caturl
信息。{{cd6}你永远不会在cd6}函数中传递。因此,它也被忽视了。所以,唯一基本上可以工作的就是图像管道,如果你像我已经建议的那样修复它的话。为了在您的代码中解决这些问题,请遵循以下准则(请记住,这不是经过测试的,它只是您考虑的准则):
^{pr2}$相关问题 更多 >
编程相关推荐