基于redis的scrapy组件。

scrapy-redis的Python项目详细描述


垃圾redis

documentation statushttps://img.shields.io/pypi/v/scrapy-redis.svghttps://img.shields.io/pypi/pyversions/scrapy-redis.svghttps://img.shields.io/travis/rolando/scrapy-redis.svg覆盖状态代码质量状态需求状态

用于废料的基于redis的组件。

  • 自由软件:麻省理工学院许可证
  • 文档:https://scrapy-redis.readthedocs.org" rel="nofollow">https://scrapy redis.readthedocs.org
  • python版本:2.7、3.4+

功能

  • 分布式爬网/抓取

    < Buff行情>

    可以启动共享单个redis队列的多个spider实例。 最适合广泛的多域爬网。

  • 分布式后处理

    < Buff行情>

    报废项目将被推送到redis队列中,这意味着您可以从 许多按需共享项目队列的后处理进程。

  • 破旧的即插即用部件

    < Buff行情>

    调度程序+复制筛选器、项目管道、基本蜘蛛。

要求

  • python 2.7、3.4或3.5
  • redis=2.8
  • scrapy >;=1.0
  • redis py >;=2.10

用法

在项目中使用以下设置:

# Enables scheduling storing requests queue in redis.SCHEDULER="scrapy_redis.scheduler.Scheduler"# Ensure all spiders share same duplicates filter through redis.DUPEFILTER_CLASS="scrapy_redis.dupefilter.RFPDupeFilter"# Default requests serializer is pickle, but it can be changed to any module# with loads and dumps functions. Note that pickle is not compatible between# python versions.# Caveat: In python 3.x, the serializer must return strings keys and support# bytes as values. Because of this reason the json or msgpack module will not# work by default. In python 2.x there is no such issue and you can use# 'json' or 'msgpack' as serializers.#SCHEDULER_SERIALIZER = "scrapy_redis.picklecompat"# Don't cleanup redis queues, allows to pause/resume crawls.#SCHEDULER_PERSIST = True# Schedule requests using a priority queue. (default)#SCHEDULER_QUEUE_CLASS = 'scrapy_redis.queue.PriorityQueue'# Alternative queues.#SCHEDULER_QUEUE_CLASS = 'scrapy_redis.queue.FifoQueue'#SCHEDULER_QUEUE_CLASS = 'scrapy_redis.queue.LifoQueue'# Max idle time to prevent the spider from being closed when distributed crawling.# This only works if queue class is SpiderQueue or SpiderStack,# and may also block the same time when your spider start at the first time (because the queue is empty).#SCHEDULER_IDLE_BEFORE_CLOSE = 10# Store scraped item in redis for post-processing.ITEM_PIPELINES={'scrapy_redis.pipelines.RedisPipeline':300}# The item pipeline serializes and stores the items in this redis key.#REDIS_ITEMS_KEY = '%(spider)s:items'# The items serializer is by default ScrapyJSONEncoder. You can use any# importable path to a callable object.#REDIS_ITEMS_SERIALIZER = 'json.dumps'# Specify the host and port to use when connecting to Redis (optional).#REDIS_HOST = 'localhost'#REDIS_PORT = 6379# Specify the full Redis URL for connecting (optional).# If set, this takes precedence over the REDIS_HOST and REDIS_PORT settings.#REDIS_URL = 'redis://user:pass@hostname:9001'# Custom redis client parameters (i.e.: socket timeout, etc.)#REDIS_PARAMS  = {}# Use custom redis client class.#REDIS_PARAMS['redis_cls'] = 'myproject.RedisClient'# If True, it uses redis' ``spop`` operation. This could be useful if you# want to avoid duplicates in your start urls list. In this cases, urls must# be added via ``sadd`` command or you will get a type error from redis.#REDIS_START_URLS_AS_SET = False# Default start urls key for RedisSpider and RedisCrawlSpider.#REDIS_START_URLS_KEY = '%(name)s:start_urls'# Use other encoding than utf-8 for redis.#REDIS_ENCODING = 'latin1'
< div > <注< > >

版本0.3将请求序列化从 marshal 更改为 cpickle , 因此,使用0.2版的持久化请求将无法在0.3上工作。

欢迎加入QQ群-->: 979659372 Python中文网_新手群

推荐PyPI第三方库


热门话题
java语法错误,请插入“;”完成陈述。我怎样才能解决它?   java为什么我会在这里收到类型安全警告?(仿制药)   jQuery的java datepicker插件不会将值发送到Struts 2中的后端   解析时java JSON对象为空   java IntelliJ不会识别从gradle项目导入的一些内容   Android中的java隐藏、显示和聚焦editText   java每个客户端一个线程。可行吗?   sql server 2008 Java更新管理表时出错   Eclipse CDT无头构建C++ java死锁   jboss上的java ear部署   Java编译器或安卓编译器正在截断我的双精度。。。?   java如何使我的if语句在ifelse语句运行时再次运行?   java JMeter,在线程组之间传递值   格式化Java整数金字塔   java如何知道gmail集成安卓应用程序中已登录的用户