我是python和scrapy(用python写的网站爬行工具…)的新手,希望有人能在我的路上给我一些启示。。。我刚刚写了一个由两个解析函数组成的spider: -第一个解析函数,用于解析我正在爬网的起始页&;它包含7个级别的章节和子章节,其中一些章节位于不同的级别(指向文章或文章列表) -第二个解析函数用于解析文章或文章列表,并作为皮屑。请求(...) 这个spider的目标是创建一个包含章节、子章节、文章及其内容的整个内容的大DOM。你知道吗
我在第二个函数中遇到了一个问题,它似乎有时接收到的响应与调用时使用的url中的内容不一致皮屑。请求. 将并发\u请求设置为1时,此问题消失。我最初认为这是由于一些多线程/非可重入函数pb,但发现我没有重入问题,后来读到,scrapy实际上不是多线程的。。。所以我不知道我的pb是从哪里来的。你知道吗
这里是我的代码片段
#---------------------------------------------
# Init part:
#---------------------------------------------
import scrapy
from scrapy import signals
from xml.etree.ElementTree import Element, SubElement, Comment, tostring
from scrapy.exceptions import CloseSpider
top = Element('top')
curChild = top
class mytest(scrapy.Spider):
name = 'lfb'
#
# This is what make my code working but I don't know why !!!
# Ideally would like to benefit from the speed of having several concurrent
# requests when crawling & parsing
#
custom_settings = {
'CONCURRENT_REQUESTS': '1',
}
#
# This section is just here to be able to do something when the spider closes
# In this case I want to print the DOM I've created.
@classmethod
def from_crawler(cls, crawler, *args, **kwargs):
spider = super(mytest, cls).from_crawler(crawler, *args, **kwargs)
crawler.signals.connect(spider.spider_closed, signal=signals.spider_closed)
return spider
def spider_closed(self, spider):
print ("Spider closed - !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!")
#this is to print the DOM created at the end
print tostring(top)
def parse(self, response):
pass
def start_requests(self):
level = 0
print "Start parsing legifrance level set to %d" % level
# This is to print the DOM which is empty (or almost - just the top element in there)
print tostring(top)
yield scrapy.Request("<Home Page>", callback=self.parse)
#----------------------------------------------
# First parsing function - Parsing the Home page - this one works fine (I think)
#----------------------------------------------
def parse(self, response):
for sel in response.xpath('//span'):
cl = sel.xpath("@class").extract()
desc = sel.xpath('text()').extract()
#
# Do some stuff here depending on the class (cl) of 'span' which corresponds
# to either one of the # 7 levels of chapters & sub-chapters or to list of
# articles attached to a sub-chapters. To simplify I'm just putting here the
# code corresponding to the handling of list of articles (cl == codeLienArt)
# ...
# ...
if cl == [unicode('codeLienArt')]:
art_plink= sel.css('a::attr("href")').extract()
artLink= "<Base URL>"+str(unicode(art_plink[0]))
#
# curChild points to the element in the DOM to which the list of articles
# should be attached. Pass it in the request meta, in order for the second
# parsing function to place the articles & their content at the right place
# in the DOM
#
thisChild = curChild
#
# print for debug - thisChild.text contains the heading of the sub-chapter
# to which the list of articles that will be processed by parse1 should be
# attached.
#
print "follow link cl:%s art:%s for %s" % (cl, sel.xpath('a/text()').extract(), thisChild.text )
#
# get the list of articles following artLink & pass the response to the second parsing function
# (I know it's called parse1 :-)
#
yield scrapy.Request(artLink, callback=self.parse1, meta={ 'element': thisChild })
#-------------------
# This is the second parsing function that parses list of Articles & their content
# format is basically one or several articles, each being presented(simplified) as
# < div class="Articles">
# <div class="titreArt"> Title here</div>
# <div class="corpsArt"> Sometime some text and often a list of paragraph <p>sentences</p>" ></div>
# </div>
#-------------------
def parse1(self, resp):
print "enter parse1"
numberOfArticles= 0
for selArt in resp.xpath('//div[@class="article"]'):
#
# This is where I see the problem when CONCURRENT_REQUESTS > 1, sometimes
# the response points to a page that is not the page that was requested in
# the previous parsing function...
#
clArt = selArt.xpath('.//div[@class="titreArt"]/text()').extract()
print clArt
numberOfArticles += 1
childArt = SubElement(resp.meta['element'], 'Article')
childArt.text =str(unicode("%s" % clArt[0]))
corpsArt = selArt.xpath('.//div[@class="corpsArt"]/text()').extract()
print "corpsArt=%s" % corpsArt
temp = ''
for corpsItem in corpsArt:
if corpsItem != '\n':
temp += corpsItem
if temp != '':
childCorps = SubElement(childArt, 'p')
childCorps.text = temp
print "corpsArt is not empty %s" % temp
for paraArt in selArt.xpath('.//div[@class="corpsArt"]//p/text()').extract():
childPara = SubElement(childArt, 'p')
childPara.text = paraArt
print "childPara.text=%s" % childPara.text
print "link followed %s (%d)" % (resp.url,numberOfArticles)
print "leave parse1"
yield
目前没有回答
相关问题 更多 >
编程相关推荐