垃圾爬虫没有看到任何项目被刮。

2024-10-06 10:24:53 发布

您现在位置:Python中文网/ 问答频道 /正文

我有兴趣从我的数据库中拉出一些网址,以便在我下次重新爬网时先进行爬网。在

我已经为此编写了一个自定义中间件,但是由于JOBDIR的一个错误,我无法使用自定义中间件实现我想要的(检查线程here

所以我决定编写自己的定制调度程序,因为scrapy将从这里获得所有请求。在

这是我的自定义队列:

class GhostQ(object):
  """
  A Queue of Requests for ghost objects from our last crawl.
  -Can only be popped.
  -Basically a generator with length property
  -Does not include current ghost objects in the scheduler

  A "ghost" object is defined to be an object persisted into our db from crawling that has
  not yet been visited. They exist because we need to persist relationships and the only way
  we can do so is by persisting a blank object first.

  Entities that we have to check:
    -Looks, check if image_url is null
    -Clothing, check if name, brand, type is null
    -Model, check if name is null
  """
  def __init__(self, glooks_cursor, gclothing_cursor, gmodels_cursor, priority=5, yield_per=100):
    self._length = None
    self._generator = None

    self.PRIORITY = priority
    self.YIELD_PER = yield_per

    self.glooks_cursor = glooks_cursor
    self.gclothing_cursor = gclothing_cursor
    self.gmodels_cursor = gmodels_cursor

  def _init_length(self):
    total = self.glooks_cursor.count() + self.gmodels_cursor.count() \
        + self.gclothing_cursor.count()
    self._length = total
    log.msg("GhostQ has %d objects" % self._length)

  def __len__(self):
    if self._length is None:
      self._init_length()
    return self._length

  def _init_generator(self):
    """The use of all here allows us to retrieve everything at once.

    TODO (Somehow this is not breaking??!?)
      -yield_per should be breaking because we are also committing
      -Perhaps we need our our session here?
       i.e new_session = scoped_session(sessionmaker(autocommit=False, autoflush=True, bind=engine))
    """
    for look in self.glooks_cursor.yield_per(self.YIELD_PER):
      yield Request(look.url, priority=self.PRIORITY)

    for clothing in self.gclothing_cursor.yield_per(self.YIELD_PER):
      yield Request(clothing.look.url, priority=self.PRIORITY)

    for model in self.gmodels_cursor.yield_per(self.YIELD_PER):
      yield Request(model.url, priority=self.PRIORITY)

  def pop(self):
    if self._generator is None:
      self._generator = self._init_generator()
    try:
      request = self._generator.next()
      if self._length is None:
        self._init_length()
      self._length -= 1
      return request
    except StopIteration:
      return None

这是我的自定义调度程序:

^{pr2}$

当我运行这个调度程序时,我看不到任何东西被删除。。。在

这是log

但是,如果我禁用这一行:request = self.ghostq.pop(),我会看到东西又被刮走了。在

这是个很奇怪的虫子。我好像不明白为什么。我最初怀疑dupefilter是罪魁祸首,但我意识到dupefilter只过滤请求而不是对象。在

如果我禁用这行代码:log


Tags: selfnoneifinitisdefgeneratorlength