当并发\u请求>1时,Pb与scrapy

2024-06-26 02:27:32 发布

您现在位置:Python中文网/ 问答频道 /正文

我是python和scrapy(用python写的网站爬行工具…)的新手,希望有人能在我的路上给我一些启示。。。我刚刚写了一个由两个解析函数组成的spider: -第一个解析函数,用于解析我正在爬网的起始页&;它包含7个级别的章节和子章节,其中一些章节位于不同的级别(指向文章或文章列表) -第二个解析函数用于解析文章或文章列表,并作为皮屑。请求(...) 这个spider的目标是创建一个包含章节、子章节、文章及其内容的整个内容的大DOM。你知道吗

我在第二个函数中遇到了一个问题,它似乎有时接收到的响应与调用时使用的url中的内容不一致皮屑。请求. 将并发\u请求设置为1时,此问题消失。我最初认为这是由于一些多线程/非可重入函数pb,但发现我没有重入问题,后来读到,scrapy实际上不是多线程的。。。所以我不知道我的pb是从哪里来的。你知道吗

这里是我的代码片段

#---------------------------------------------
# Init part:
#---------------------------------------------
import scrapy
from scrapy import signals
from xml.etree.ElementTree import Element, SubElement, Comment, tostring
from scrapy.exceptions import CloseSpider

top = Element('top')
curChild = top

class mytest(scrapy.Spider):
    name = 'lfb'

#
# This is what make my code working but I don't know why !!! 
# Ideally would like to benefit from the speed of having several concurrent
# requests when crawling & parsing 
#
    custom_settings = {
        'CONCURRENT_REQUESTS': '1',
    }

#
# This section is just here to be able to do something when the spider closes
# In this case I want to print the DOM I've created.
    @classmethod
    def from_crawler(cls, crawler, *args, **kwargs):
        spider = super(mytest, cls).from_crawler(crawler, *args, **kwargs)
        crawler.signals.connect(spider.spider_closed, signal=signals.spider_closed)
        return spider

    def spider_closed(self, spider):
        print ("Spider closed - !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!")
#this is to print the DOM created at the end
        print tostring(top)

    def parse(self, response):
        pass


    def start_requests(self):
        level = 0
        print "Start parsing legifrance level set to %d" % level
# This is to print the DOM which is empty (or almost - just the top element in there)
        print tostring(top)
        yield scrapy.Request("<Home Page>", callback=self.parse)

#----------------------------------------------
# First parsing function - Parsing the Home page - this one works fine (I think)
#----------------------------------------------
    def parse(self, response):
        for sel in response.xpath('//span'):
            cl = sel.xpath("@class").extract()
            desc = sel.xpath('text()').extract()
#
# Do some stuff here depending on the class (cl) of 'span' which corresponds 
# to either one of the # 7 levels of chapters & sub-chapters or to list of
# articles attached to a sub-chapters. To simplify I'm just putting here the 
# code corresponding to the handling of list of articles (cl == codeLienArt)
#           ...
#           ...
        if cl == [unicode('codeLienArt')]: 
            art_plink= sel.css('a::attr("href")').extract()
            artLink= "<Base URL>"+str(unicode(art_plink[0]))
#
# curChild points to the element in the DOM to which the list of articles
# should be attached. Pass it in the request meta, in order for the second
# parsing function to place the articles & their content at the right place
# in the DOM
#
            thisChild = curChild
#
# print for debug - thisChild.text contains the heading of the sub-chapter
# to which the list of articles that will be processed by parse1 should be
# attached.
#
            print "follow link cl:%s art:%s for %s" % (cl, sel.xpath('a/text()').extract(), thisChild.text )
#
# get the list of articles following artLink & pass the response to the second parsing function 
# (I know it's called parse1 :-)
#
            yield scrapy.Request(artLink, callback=self.parse1, meta={ 'element': thisChild })

#-------------------
# This is the second parsing function that parses list of Articles & their content
# format is basically one or several articles, each being presented(simplified) as
# < div class="Articles">
#   <div class="titreArt"> Title here</div>
#   <div class="corpsArt"> Sometime some text and often a list of paragraph    <p>sentences</p>" ></div>
#  </div>
#-------------------
    def parse1(self, resp):
    print "enter parse1"
    numberOfArticles= 0
    for selArt in resp.xpath('//div[@class="article"]'):
#
# This is where I see the problem when CONCURRENT_REQUESTS > 1, sometimes
# the response points to a page that is not the page that was requested in
# the previous parsing function...
#
        clArt = selArt.xpath('.//div[@class="titreArt"]/text()').extract()
        print clArt
        numberOfArticles += 1
        childArt = SubElement(resp.meta['element'], 'Article')
        childArt.text =str(unicode("%s" % clArt[0]))
        corpsArt = selArt.xpath('.//div[@class="corpsArt"]/text()').extract()
        print "corpsArt=%s" % corpsArt
        temp = ''
        for corpsItem in corpsArt:
            if corpsItem != '\n':
                temp += corpsItem

        if temp != '':
            childCorps =  SubElement(childArt, 'p')
            childCorps.text = temp
            print "corpsArt is not empty %s" % temp
        for paraArt in selArt.xpath('.//div[@class="corpsArt"]//p/text()').extract():
            childPara = SubElement(childArt, 'p')
            childPara.text = paraArt
            print "childPara.text=%s" % childPara.text

    print "link followed %s (%d)" % (resp.url,numberOfArticles)
    print "leave parse1"
    yield

Tags: ofthetotextinselfdivis