<p>马克西姆·罗兰特的回答最终解决了我用自己的脚本构建一个难看的蜘蛛的问题。它解决了我遇到的两个问题:</p>
<ol>
<li><p>它允许连续调用蜘蛛两次(在scrapy教程中的简单示例中,这会导致崩溃,因为您不能两次启动twister reactor)</p></li>
<li><p>它允许将变量从spider返回到脚本中。</p></li>
</ol>
<p>只有一件事:这个例子不适用于我现在使用的scrapy版本(scrapy 1.5.2)和python3.7</p>
<p>在玩了一些代码之后,我得到了一个工作示例,我想与大家分享。我还有一个问题,请看下面的脚本。它是一个独立的脚本,所以我也添加了一个spider</p>
<pre><code>import logging
import multiprocessing as mp
import scrapy
from scrapy.crawler import CrawlerProcess
from scrapy.signals import item_passed
from scrapy.utils.project import get_project_settings
from scrapy.xlib.pydispatch import dispatcher
class CrawlerWorker(mp.Process):
name = "crawlerworker"
def __init__(self, spider, result_queue):
mp.Process.__init__(self)
self.result_queue = result_queue
self.items = list()
self.spider = spider
self.logger = logging.getLogger(self.name)
self.settings = get_project_settings()
self.logger.setLevel(logging.DEBUG)
self.logger.debug("Create CrawlerProcess with settings {}".format(self.settings))
self.crawler = CrawlerProcess(self.settings)
dispatcher.connect(self._item_passed, item_passed)
def _item_passed(self, item):
self.logger.debug("Adding Item {} to {}".format(item, self.items))
self.items.append(item)
def run(self):
self.logger.info("Start here with {}".format(self.spider.urls))
self.crawler.crawl(self.spider, urls=self.spider.urls)
self.crawler.start()
self.crawler.stop()
self.result_queue.put(self.items)
class QuotesSpider(scrapy.Spider):
name = "quotes"
def __init__(self, **kw):
super(QuotesSpider, self).__init__(**kw)
self.urls = kw.get("urls", [])
def start_requests(self):
for url in self.urls:
yield scrapy.Request(url=url, callback=self.parse)
else:
self.log('Nothing to scrape. Please pass the urls')
def parse(self, response):
""" Count number of The's on the page """
the_count = len(response.xpath("//body//text()").re(r"The\s"))
self.log("found {} time 'The'".format(the_count))
yield {response.url: the_count}
def report_items(message, item_list):
print(message)
if item_list:
for cnt, item in enumerate(item_list):
print("item {:2d}: {}".format(cnt, item))
else:
print(f"No items found")
url_list = [
'http://quotes.toscrape.com/page/1/',
'http://quotes.toscrape.com/page/2/',
'http://quotes.toscrape.com/page/3/',
'http://quotes.toscrape.com/page/4/',
]
result_queue1 = mp.Queue()
crawler = CrawlerWorker(QuotesSpider(urls=url_list[:2]), result_queue1)
crawler.start()
# wait until we are done with the crawl
crawler.join()
# crawl again
result_queue2 = mp.Queue()
crawler = CrawlerWorker(QuotesSpider(urls=url_list[2:]), result_queue2)
crawler.start()
crawler.join()
#
report_items("First result", result_queue1.get())
report_items("Second result", result_queue2.get())
</code></pre>
<p>如您所见,代码几乎完全相同,只是有些导入由于scrapy API的更改而发生了更改。在</p>
<p>有一件事:我收到一个带有pydistatch导入的不推荐警告:</p>
^{pr2}$
<p>我找到了<a href="https://github.com/scrapy/scrapy/issues/2959#issuecomment-335656753" rel="nofollow noreferrer">here</a>如何解决这个问题。但是,我不能让这个工作。有人知道如何应用from\u crawler类方法来消除弃用警告吗?在</p>