<p>我使用<a href="http://doc.scrapy.org/en/latest/topics/extensions.html" rel="nofollow noreferrer">Scrapy Extensions</a>方法将Spider类扩展到一个名为Masterspider的类,该类包含一个通用解析器。</p>
<p>下面是我的通用扩展解析器的“short”版本。注意,一旦开始使用AJAX处理页面,就需要使用Javascript引擎(例如<a href="https://stackoverflow.com/a/17979285">Selenium</a>或BeautifulSoup)a实现呈现程序。还有很多额外的代码来管理站点之间的差异(基于列标题的废弃、处理相对URL与长URL、管理不同类型的数据容器等)。</p>
<p>与Scrapy扩展方法相关的是,如果有不合适的地方,您仍然可以重写泛型解析器方法,但我从来没有这样做过。Masterspider类检查是否在特定于站点的spider类下创建了某些方法(例如parser_start、next_url_parser…),以允许管理特定内容:发送表单、从页面中的元素构造下一个_url请求等</p>
<p>由于我正在抓取非常不同的网站,总是有具体的管理。这就是为什么我更喜欢为每个被刮掉的站点保留一个类,这样我就可以编写一些特定的方法来处理它(除了管道、请求生成器等之外的前/后处理)。</p>
<p>masterspider/sitespider/settings.py</p>
<pre><code>EXTENSIONS = {
'masterspider.masterspider.MasterSpider': 500
}
</code></pre>
<p>masterspider/masterspdier/masterspider.py</p>
<pre><code># -*- coding: utf8 -*-
from scrapy.spider import Spider
from scrapy.selector import Selector
from scrapy.http import Request
from sitespider.items import genspiderItem
class MasterSpider(Spider):
def start_requests(self):
if hasattr(self,'parse_start'): # First page requiring a specific parser
fcallback = self.parse_start
else:
fcallback = self.parse
return [ Request(self.spd['start_url'],
callback=fcallback,
meta={'itemfields': {}}) ]
def parse(self, response):
sel = Selector(response)
lines = sel.xpath(self.spd['xlines'])
# ...
for line in lines:
item = genspiderItem(response.meta['itemfields'])
# ...
# Get request_url of detailed page and scrap basic item info
# ...
yield Request(request_url,
callback=self.parse_item,
meta={'item':item, 'itemfields':response.meta['itemfields']})
for next_url in sel.xpath(self.spd['xnext_url']).extract():
if hasattr(self,'next_url_parser'): # Need to process the next page URL before?
yield self.next_url_parser(next_url, response)
else:
yield Request(
request_url,
callback=self.parse,
meta=response.meta)
def parse_item(self, response):
sel = Selector(response)
item = response.meta['item']
for itemname, xitemname in self.spd['x_ondetailpage'].iteritems():
item[itemname] = "\n".join(sel.xpath(xitemname).extract())
return item
</code></pre>
<p>masterspider/sitespider/spiders/somesite_spider.py</p>
<pre><code># -*- coding: utf8 -*-
from scrapy.spider import Spider
from scrapy.selector import Selector
from scrapy.http import Request
from sitespider.items import genspiderItem
from masterspider.masterspider import MasterSpider
class targetsiteSpider(MasterSpider):
name = "targetsite"
allowed_domains = ["www.targetsite.com"]
spd = {
'start_url' : "http://www.targetsite.com/startpage", # Start page
'xlines' : "//td[something...]",
'xnext_url' : "//a[contains(@href,'something?page=')]/@href", # Next pages
'x_ondetailpage' : {
"itemprop123" : u"id('someid')//text()"
}
}
# def next_url_parser(self, next_url, response): # OPTIONAL next_url regexp pre-processor
# ...
</code></pre>