Python+Scrapy:从scrip运行crawler时运行“ImagesPipeline”的问题

2024-07-02 12:50:15 发布

您现在位置:Python中文网/ 问答频道 /正文

我是Python的新手,所以如果这里有一个愚蠢的错误,我向您道歉…我已经在网上搜索了好几天了,查看了类似的问题,梳理了一些粗糙的文档,但似乎没有什么能真正解决这个问题。。。你知道吗

我有一个Scrapy项目它成功地刮取源网站,返回所需的项目,然后使用ImagePipeline从返回的图像链接下载(然后相应地重命名)图像。。。但仅当我使用“runspider”从终端运行时

每当我从终端使用“crawl”或CrawlProcess在脚本中运行spider时,它返回项目,但不下载图像,并且,我假设,它完全丢失了图像管道。你知道吗

我了解到,在以这种方式运行时,我需要导入设置以正确加载管道,这在查看“crawl”和“runspider”之间的差异后是有意义的,但我仍然无法使管道正常工作。你知道吗

没有错误消息,但我注意到它确实返回”[scrapy.middleware公司]信息:启用的项目管道:[]“。。。我以为这是在表明它还没有找到我的管道?你知道吗

这是我的蜘蛛.py:

import scrapy from scrapy2.items import Scrapy2Item from scrapy.crawler import CrawlerProcess from scrapy.utils.project import get_project_settings class spider1(scrapy.Spider): name = "spider1" domain = "https://www.amazon.ca/s?k=821826022317" def start_requests(self): yield scrapy.Request(url=spider1.domain ,callback = self.parse) def parse(self, response): items = Scrapy2Item() titlevar = response.css('span.a-text-normal ::text').extract_first() imgvar = [response.css('img ::attr(src)').extract_first()] skuvar = response.xpath('//meta[@name="keywords"]/@content')[0].extract() items['title'] = titlevar items['image_urls'] = imgvar items['sku'] = skuvar yield items process = CrawlerProcess(get_project_settings()) process.crawl(spider1) process.start()

这是我的项目.py:

import scrapy class Scrapy2Item(scrapy.Item): title = scrapy.Field() image_urls = scrapy.Field() sku = scrapy.Field()

这是我的管道.py:

import scrapy from scrapy.pipelines.images import ImagesPipeline class Scrapy2Pipeline(ImagesPipeline): def get_media_requests(self, item, info): return [scrapy.Request(x, meta={'image_name': item['sku']}) for x in item.get('image_urls', [])] def file_path(self, request, response=None, info=None): return '%s.jpg' % request.meta['image_name']

这是我的设置.py:

BOT_NAME = 'scrapy2' SPIDER_MODULES = ['scrapy2.spiders'] NEWSPIDER_MODULE = 'scrapy2.spiders' ROBOTSTXT_OBEY = True ITEM_PIPELINES = { 'scrapy2.pipelines.Scrapy2Pipeline': 1, } IMAGES_STORE = 'images'

感谢所有看到这个甚至试图帮助我的人。非常感谢。你知道吗


Tags: 项目namefrompy图像imageimportself
1条回答
网友
1楼 · 发布于 2024-07-02 12:50:15

由于您将spider作为脚本来运行,因此不存在任何粗糙的项目环境,get_project_settings将不起作用(除了获取默认设置之外)。 脚本必须是自包含的,即包含运行spider所需的所有内容(或者从python搜索路径导入它,就像任何常规的python代码一样)。你知道吗

我已经为您重新格式化了该代码,以便在您使用纯python解释器执行它时运行:python3 script.py。你知道吗

#!/usr/bin/env python3
# -*- coding: utf-8 -*-
import scrapy
from scrapy.pipelines.images import ImagesPipeline

BOT_NAME = 'scrapy2'
ROBOTSTXT_OBEY = True
IMAGES_STORE = 'images'


class Scrapy2Item(scrapy.Item):
    title = scrapy.Field()
    image_urls = scrapy.Field()
    sku = scrapy.Field()

class Scrapy2Pipeline(ImagesPipeline):
    def get_media_requests(self, item, info):
        return [scrapy.Request(x, meta={'image_name': item['sku']})
                for x in item.get('image_urls', [])]

    def file_path(self, request, response=None, info=None):
        return '%s.jpg' % request.meta['image_name']

class spider1(scrapy.Spider):
    name = "spider1"
    domain = "https://www.amazon.ca/s?k=821826022317"

    def start_requests(self):
        yield scrapy.Request(url=spider1.domain ,callback = self.parse)

    def parse(self, response):

        items = Scrapy2Item()

        titlevar = response.css('span.a-text-normal ::text').extract_first()
        imgvar = [response.css('img ::attr(src)').extract_first()]
        skuvar = response.xpath('//meta[@name="keywords"]/@content')[0].extract()

        items['title'] = titlevar
        items['image_urls'] = imgvar
        items['sku'] = skuvar

        yield items

if __name__ == "__main__":
    from scrapy.crawler import CrawlerProcess
    from scrapy.settings import Settings

    settings = Settings(values={
        'BOT_NAME': BOT_NAME,
        'ROBOTSTXT_OBEY': ROBOTSTXT_OBEY,
        'ITEM_PIPELINES': {
            '__main__.Scrapy2Pipeline': 1,
        },
        'IMAGES_STORE': IMAGES_STORE,
        'TELNETCONSOLE_ENABLED': False,
    })

    process = CrawlerProcess(settings=settings)
    process.crawl(spider1)
    process.start()

相关问题 更多 >