使用“刮痧”作为物品生成

2024-09-30 08:32:41 发布

您现在位置:Python中文网/ 问答频道 /正文

我有一个现有的脚本(主.py)这就需要收集数据。在

我开始了一个很糟糕的项目来检索这些数据。现在,有什么办法吗主.py是否可以将scray作为项生成器检索数据,而不是使用项管道持久化数据?在

像这样的东西真的很方便,但我不知道怎么做,如果可行的话。在

for item in scrapy.process():

我在那里找到了一个潜在的解决方案:https://tryolabs.com/blog/2011/09/27/calling-scrapy-python-script/,使用多线程的队列。在

尽管我知道这种行为与分布式爬行不兼容,这正是Scrapy的目的,但我还是有点惊讶,您不能将这个特性用于较小的项目。在


Tags: 数据项目inpyhttps脚本for管道
2条回答

在Twisted或Tornado应用程序中可以这样做:

import collections

from twisted.internet.defer import Deferred
from scrapy.crawler import Crawler
from scrapy import signals


def scrape_items(crawler_runner, crawler_or_spidercls, *args, **kwargs):
    """
    Start a crawl and return an object (ItemCursor instance)
    which allows to retrieve scraped items and wait for items
    to become available.

    Example:

    .. code-block:: python

        @inlineCallbacks
        def f():
            runner = CrawlerRunner()
            async_items = scrape_items(runner, my_spider)
            while (yield async_items.fetch_next):
                item = async_items.next_item()
                # ...
            # ...

    This convoluted way to write a loop should become unnecessary
    in Python 3.5 because of ``async for``.
    """
    # this requires scrapy >= 1.1rc1
    crawler = crawler_runner.create_crawler(crawler_or_spidercls)
    # for scrapy < 1.1rc1 the following code is needed:
    # crawler = crawler_or_spidercls
    # if not isinstance(crawler_or_spidercls, Crawler):
    #    crawler = crawler_runner._create_crawler(crawler_or_spidercls)

    d = crawler_runner.crawl(crawler, *args, **kwargs)
    return ItemCursor(d, crawler)


class ItemCursor(object):
    def __init__(self, crawl_d, crawler):
        self.crawl_d = crawl_d
        self.crawler = crawler

        crawler.signals.connect(self._on_item_scraped, signals.item_scraped)

        crawl_d.addCallback(self._on_finished)
        crawl_d.addErrback(self._on_error)

        self.closed = False
        self._items_available = Deferred()
        self._items = collections.deque()

    def _on_item_scraped(self, item):
        self._items.append(item)
        self._items_available.callback(True)
        self._items_available = Deferred()

    def _on_finished(self, result):
        self.closed = True
        self._items_available.callback(False)

    def _on_error(self, failure):
        self.closed = True
        self._items_available.errback(failure)

    @property
    def fetch_next(self):
        """
        A Deferred used with ``inlineCallbacks`` or ``gen.coroutine`` to
        asynchronously retrieve the next item, waiting for an item to be
        crawled if necessary. Resolves to ``False`` if the crawl is finished,
        otherwise :meth:`next_item` is guaranteed to return an item
        (a dict or a scrapy.Item instance).
        """
        if self.closed:
            # crawl is finished
            d = Deferred()
            d.callback(False)
            return d

        if self._items:
            # result is ready
            d = Deferred()
            d.callback(True)
            return d

        # We're active, but item is not ready yet. Return a Deferred which
        # resolves to True if item is scraped or to False if crawl is stopped.
        return self._items_available

    def next_item(self):
        """Get a document from the most recently fetched batch, or ``None``.
        See :attr:`fetch_next`.
        """
        if not self._items:
            return None
        return self._items.popleft()

其主要思想是监听item_scraped信号,然后用更好的API将其包装到一个对象中。在

请注意,您需要在主.py脚本,上面的例子twisted.defer.inlineCallbacks或者龙卷风发电机协同程序. 在

您可以从爬虫程序发送json数据并获取结果。具体做法如下:

有蜘蛛:

class MySpider(scrapy.Spider):
    # some attributes
    accomulated=[]

    def parse(self, response):
        # do your logic here
        page_text = response.xpath('//text()').extract()
        for text in page_text:
            if conditionsAreOk( text ):
                self.accomulated.append(text)

    def closed( self, reason ):
        # call when the crawler process ends
        print JSON.dumps(self.accomulated)

写一个跑步者.py类似脚本:

^{pr2}$

然后从你的主.py作为:

import json, subprocess, sys, time

def main(argv): 

    # urlArray has http:// or https:// like urls
    for url in urlArray:    
        p = subprocess.Popen(['python', 'runner.py', url ], stdout=subprocess.PIPE, stderr=subprocess.PIPE)
        out, err = p.communicate()

        # do something with your data
        print out
        print json.loads(out)

        # This just helps to watch logs
        time.sleep(0.5)


if __name__ == "__main__":
   main(sys.argv[1:])

注意

正如您所知,这不是使用Scrapy的最佳方法,但是对于不需要复杂的后期处理的快速结果,此解决方案可以提供您所需的。在

我希望有帮助。在

相关问题 更多 >

    热门问题