我定义了一个我希望写入全局变量的scrapy方法:
我用占位符值设置了一个全局变量
currentTitle = 'unchanged global title'
然后我定义了痒蜘蛛
class QuotesSpider(scrapy.Spider):
name = "quotes"
currentClassTitle = 'unchanged class title'
def start_requests(self):
urls = [
#here goes my list of urls to scrape
]
for url in urls:
yield scrapy.Request(url=url, callback=self.parse)
def parse(self, response):
title=response.xpath('/html/head/title').getall()
title=str(title)
title=title[9:-10]
print(title)
#So far so good, the title is correctly extracted and printed
#I intend to write the title to both global currentTitle variable
# and to class variable currentClassTitle, using method update for the latter:
global currentTitle
currentTitle = title
QuotesSpider.update(title)
def update(value):
QuotesSpider.currentClassTitle = value
接下来是我不太熟悉的标准刮擦材料,但在我遇到这个问题之前一直很好地工作
def crawl ():
process = CrawlerProcess({
'USER_AGENT': 'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)'
})
process.crawl(QuotesSpider)
process.start() # the script will block here until the crawling is finished
time.sleep(2)
def run_spider(spider):
def f(q):
try:
runner = crawler.CrawlerRunner()
deferred = runner.crawl(spider)
deferred.addBoth(lambda _: reactor.stop())
reactor.run()
q.put(None)
except Exception as e:
q.put(e)
q = Queue()
p = Process(target=f, args=(q,))
p.start()
result = q.get()
p.join()
if result is not None:
raise result
下面是一个函数,每当某个特定Google Firestore集合的任何文档的值为crawled:False(它将更改为True)时,该函数都会触发scraper
def on_snapshot(col_snapshot, changes, read_time):
docCounter = 0
for doc in col_snapshot:
print(u'{}'.format(doc.id))
thisDoc_ref = db.collection(u'urls').document(doc.id)
thisDoc_ref.update({u'capital': 'sample capital name'})
thisDoc_ref.update({u'crawled': True})
run_spider(QuotesSpider)
sleep(5)
#Just to ensure that I give the crawler enough time to process, the function sleeps after triggering the spider.
#Not the best practice, but good enough for testing functionality for the moment
print(QuotesSpider.currentClassTitle)
#I get 'unchanged class title'
print(currentTitle)
#I get 'unchanged global title'
thisDoc_ref.update({u'title': currentTitle})
#I intend to store the document's title in Firestore using the value of currentTitle, which will not work
#because I cannot retrieve the value of title
不管怎样,从class属性QuotesSpider.currentClassTitle或全局变量currentTitle获取标题的值对我来说都是可行的,但它们都不起作用。运行spider时,我似乎无法更新其中任何一个的值
如果您想在Python类的一个类方法中访问特定于Python类的实例的变量,您需要使用
使用
QuotesSpider.currentClassTitle
访问默认类变量(当您的类未初始化时)更多信息请点击此处: https://www.geeksforgeeks.org/self-in-python-class/
相关问题 更多 >
编程相关推荐