无法从Python Scrapy方法写入全局变量

2024-05-03 06:52:32 发布

您现在位置:Python中文网/ 问答频道 /正文

我定义了一个我希望写入全局变量的scrapy方法:

我用占位符值设置了一个全局变量

currentTitle = 'unchanged global title'

然后我定义了痒蜘蛛

class QuotesSpider(scrapy.Spider):
    name = "quotes"
    currentClassTitle = 'unchanged class title'

    def start_requests(self):
        urls = [
            #here goes my list of urls to scrape
        ]
        for url in urls:
            yield scrapy.Request(url=url, callback=self.parse)

    def parse(self, response):
        title=response.xpath('/html/head/title').getall()
        title=str(title)
        title=title[9:-10]
        print(title)
        #So far so good, the title is correctly extracted and printed
        #I intend to write the title to both global currentTitle variable
        # and to class variable currentClassTitle, using method update for the latter:

        global currentTitle
        currentTitle = title
        QuotesSpider.update(title)

    def update(value):
        QuotesSpider.currentClassTitle = value

接下来是我不太熟悉的标准刮擦材料,但在我遇到这个问题之前一直很好地工作

def crawl ():
    process = CrawlerProcess({
        'USER_AGENT': 'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)'
    })

    process.crawl(QuotesSpider)
    process.start() # the script will block here until the crawling is finished
    time.sleep(2)

def run_spider(spider):
    def f(q):
        try:
            runner = crawler.CrawlerRunner()
            deferred = runner.crawl(spider)
            deferred.addBoth(lambda _: reactor.stop())
            reactor.run()
            q.put(None)
        except Exception as e:
            q.put(e)

    q = Queue()
    p = Process(target=f, args=(q,))
    p.start()
    result = q.get()
    p.join()

    if result is not None:
        raise result

下面是一个函数,每当某个特定Google Firestore集合的任何文档的值为crawled:False(它将更改为True)时,该函数都会触发scraper

def on_snapshot(col_snapshot, changes, read_time):
    docCounter = 0
    for doc in col_snapshot:
        print(u'{}'.format(doc.id))
        thisDoc_ref = db.collection(u'urls').document(doc.id)
        thisDoc_ref.update({u'capital': 'sample capital name'})
        thisDoc_ref.update({u'crawled': True})
        run_spider(QuotesSpider)
        sleep(5)
        #Just to ensure that I give the crawler enough time to process, the function sleeps after triggering the spider. 
        #Not the best practice, but good enough for testing functionality for the moment

        print(QuotesSpider.currentClassTitle)
#I get 'unchanged class title'

        print(currentTitle)
#I get 'unchanged global title'

        thisDoc_ref.update({u'title': currentTitle})
#I intend to store the document's title in Firestore using the value of currentTitle, which will not work
#because I cannot retrieve the value of title

不管怎样,从class属性QuotesSpider.currentClassTitle或全局变量currentTitle获取标题的值对我来说都是可行的,但它们都不起作用。运行spider时,我似乎无法更新其中任何一个的值


Tags: thetofortitledefupdateurlsglobal