Scrapy无法下载pictu

2024-05-12 21:05:06 发布

您现在位置:Python中文网/ 问答频道 /正文

我正在使用Scrapy (0.22)爬网一个网站。我需要做三件事:

  1. 我需要图像的类别和子类别
  2. 我需要下载图片并保存在本地
  3. 我需要在Mongo中存储categroy,subcategory,image url

但是现在我被屏蔽了,我用‘管道’下载图片,但是我的代码不能正常工作,它不能把图片下载到本地。在

另外,既然我想把信息存储在Mongo中,有人可以给我一些关于“Mongo表结构”的建议吗?在

我的代码如下:

设置.py

BOT_NAME = 'tutorial'

SPIDER_MODULES = ['tutorial.spiders']
NEWSPIDER_MODULE = 'tutorial.spiders'

ITEM_PIPELINES = {'tutorial.pipelines.TutorialPipeline': 1}
IMAGES_STORE = '/ttt'

项目.py

^{pr2}$

管道.py

from scrapy.contrib.pipeline.images import ImagesPipeline
from scrapy.exceptions import DropItem
from scrapy.http import Request
from pprint import pprint as pp

class TutorialPipeline(object):
    # def get_media_requests(self, item, info):
    #     for image_url in item['image_urls']:
    #         yield Request(image_url)

    # def process_item(self, item, spider):
        # print '**********************===================*******************'
        # return item
        # pp(item)
        # pass

    def get_media_requests(self,item,info):
        # pass
        pp('**********************===================*******************')

        # yield Request(item['image_urls'])
        for image_url in item['image_urls']:
            # pass
            # print image_url
            yield Request(image_url)

蜘蛛网.py

import scrapy
import os
from pprint import pprint as pp
from scrapy import log
from scrapy.http import Request
from scrapy.selector import Selector
from scrapy.spider import Spider

from scrapy.spider import Spider
from scrapy.selector import Selector

from tutorial.items import TutorialItem
from pprint import pprint as pp

class BaiduSpider(scrapy.spider.Spider):
    name='baidu'
    start_urls=[
        # 'http://www.dmoz.org/Computers/Programming/Languages/Python/Books/'
        'http://giphy.com/categories'
    ]

    domain='http://giphy.com'

    def parse(self,response):
        selector=Selector(response)

        topCategorys=selector.xpath('//div[@id="None-list"]/a')

        # pp(topCategorys)
        items=[]
        for tc in topCategorys:
            item=TutorialItem()
            item['catname']=tc.xpath('./text()').extract()[0]
            item['caturl']=tc.xpath('./@href').extract()[0]
            if item['catname']==u'ALL':
                continue
            reqUrl=self.domain+'/'+item['caturl']
            # pp(reqUrl)
            yield Request(url=reqUrl,meta={'caturl':reqUrl},callback=self.getSecondCategory)
    def getSecondCategory(self,response):
        selector=Selector(response)
        # pp(response.meta['caturl'])
        # pp('*****************=================**************')

        secondCategorys=selector.xpath('//div[@class="grid_9 omega featured-category-tags"]/div/a')

        # pp(secondCategorys)
        items=[]
        for sc in secondCategorys:
            item=TutorialItem()
            item['catname']=sc.xpath('./div/h4/text()').extract()[0]
            item['caturl']=sc.xpath('./@href').extract()[0]
            items.append(item)

            reqUrl=self.domain+item['caturl']
        # pp(items)
            # pp(item)
            # pp(reqUrl)
            yield Request(url=reqUrl,meta={'caturl':reqUrl},callback=self.getImages)

    def getImages(self,response):
        selector=Selector(response)
        # pp(response.meta['caturl'])
        # pp('*****************=================**************')


        # images=selector.xpath('//ul[@class="gifs  freeform grid_12"]/div[position()=3]')
        images=selector.xpath('//*[contains (@class,"hoverable-gif")]')
        # images=selector.xpath('//ul[@class="gifs  freeform grid_12"]//div[@class="hoverable-gif"]')
        # pp(len(images))
        items=[]
        for image in images:
            item=TutorialItem()
            item['image_urls']=image.xpath('./a/figure/img/@src').extract()[0]
            # item['imgName']=image.xpath('./a/figure/img/@alt').extract()[0]
            items.append(item)
            # pp(item)
            # pp(items)
            # pp('==============************==============')

        # pp(items)
        # items=[{'images':"hello world"}]
        return items

另外,输出中没有错误,如下所示:

2014-12-21 13:49:56+0800 [scrapy] INFO: Enabled item pipelines: TutorialPipeline
2014-12-21 13:49:56+0800 [baidu] INFO: Spider opened
2014-12-21 13:49:56+0800 [baidu] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2014-12-21 13:49:56+0800 [scrapy] DEBUG: Telnet console listening on 0.0.0.0:6023
2014-12-21 13:49:56+0800 [scrapy] DEBUG: Web service listening on 0.0.0.0:6080
2014-12-21 13:50:07+0800 [baidu] DEBUG: Crawled (200) <GET http://giphy.com/categories> (referer: None)
2014-12-21 13:50:08+0800 [baidu] DEBUG: Crawled (200) <GET http://giphy.com//categories/science/> (referer: http://giphy.com/categories)
2014-12-21 13:50:08+0800 [baidu] DEBUG: Crawled (200) <GET http://giphy.com//categories/sports/> (referer: http://giphy.com/categories)
2014-12-21 13:50:08+0800 [baidu] DEBUG: Crawled (200) <GET http://giphy.com//categories/news-politics/> (referer: http://giphy.com/categories)
2014-12-21 13:50:09+0800 [baidu] DEBUG: Crawled (200) <GET http://giphy.com//categories/transportation/> (referer: http://giphy.com/categories)
2014-12-21 13:50:09+0800 [baidu] DEBUG: Crawled (200) <GET http://giphy.com//categories/interests/> (referer: http://giphy.com/categories)
2014-12-21 13:50:09+0800 [baidu] DEBUG: Crawled (200) <GET http://giphy.com//categories/memes/> (referer: http://giphy.com/categories)
2014-12-21 13:50:09+0800 [baidu] DEBUG: Crawled (200) <GET http://giphy.com//categories/tv/> (referer: http://giphy.com/categories)
2014-12-21 13:50:09+0800 [baidu] DEBUG: Crawled (200) <GET http://giphy.com//categories/gaming/> (referer: http://giphy.com/categories)
2014-12-21 13:50:10+0800 [baidu] DEBUG: Crawled (200) <GET http://giphy.com//categories/nature/> (referer: http://giphy.com/categories)
2014-12-21 13:50:10+0800 [baidu] DEBUG: Crawled (200) <GET http://giphy.com//categories/emotions/> (referer: http://giphy.com/categories)
2014-12-21 13:50:10+0800 [baidu] DEBUG: Crawled (200) <GET http://giphy.com//categories/movies/> (referer: http://giphy.com/categories)
2014-12-21 13:50:10+0800 [baidu] DEBUG: Crawled (200) <GET http://giphy.com//categories/holiday/> (referer: http://giphy.com/categories)
2014-12-21 13:50:11+0800 [baidu] DEBUG: Crawled (200) <GET http://giphy.com//categories/reactions/> (referer: http://giphy.com/categories)
2014-12-21 13:50:11+0800 [baidu] DEBUG: Crawled (200) <GET http://giphy.com//categories/music/> (referer: http://giphy.com/categories)
2014-12-21 13:50:11+0800 [baidu] DEBUG: Crawled (200) <GET http://giphy.com//categories/decades/> (referer: http://giphy.com/categories)
2014-12-21 13:50:12+0800 [baidu] DEBUG: Crawled (200) <GET http://giphy.com/search/the-colbert-report/> (referer: http://giphy.com//categories/news-politics/)
2014-12-21 13:50:12+0800 [baidu] DEBUG: Scraped from <200 http://giphy.com/search/the-colbert-report/>
    {'image_urls': u'http://media1.giphy.com/media/2BDLDXFaEiuBy/200_s.gif'}
2014-12-21 13:50:12+0800 [baidu] DEBUG: Scraped from <200 http://giphy.com/search/the-colbert-report/>
    {'image_urls': u'http://media2.giphy.com/media/WisjAI5QGgsrC/200_s.gif'}
2014-12-21 13:50:12+0800 [baidu] DEBUG: Scraped from <200 http://giphy.com/search/the-colbert-report/>
    {'image_urls': u'http://media3.giphy.com/media/ZgDGEMihlZXCo/200_s.gif'}
.............

Tags: fromdebugimageimportcomhttpgetitem
1条回答
网友
1楼 · 发布于 2024-05-12 21:05:06

在我看来,您没有必要重写ImagesPipeline,因为您没有修改它的行为。但是,既然你在做,你就应该好好做。
重写ImagesPipeline时,应重写两个方法:

  • 获取媒体请求(item,info)应该为image_urls中的每个URL返回一个Request。这部分你做得对。

  • 当单个项目的所有图像请求都已完成(下载完成或由于某种原因失败)时,调用item_completed(results,items,info)。从official documentation

    The item_completed() method must return the output that will be sent to subsequent item pipeline stages, so you must return (or drop) the item, as you would in any pipeline.

因此,要使自定义图像管道正常工作,需要重写item\u completed()方法,如下所示:

def item_completed(self, results, item, info):
    image_paths = [x['path'] for ok, x in results if ok]
    if not image_paths:
        raise DropItem("Item contains no images")
    item['image_paths'] = image_paths
    return item

此外,关于代码中导致代码无法按预期工作的其他问题:

  1. 您实际上没有创建任何有用的项目。
    如果您查看一下您的parse()getSecondCategory()函数,您将注意到您没有返回或生成任何项。虽然您似乎已经准备好了items列表,很明显您想用它来存储项目,但它从来没有被用于实际地将项目传递到处理路径的下一步。在某一点上,您只需为下一页生成一个Request,当该函数完成时,items将被删除。

  2. 您没有使用您通过meta字典传递的caturl信息。{{cd6}你永远不会在cd6}函数中传递。因此,它也被忽视了。

所以,唯一基本上可以工作的就是图像管道,如果你像我已经建议的那样修复它的话。为了在您的代码中解决这些问题,请遵循以下准则(请记住,这不是经过测试的,它只是您考虑的准则):

^{pr2}$

相关问题 更多 >