无法在运行scrapy spider时进行刮取

class APASpider(scrapy.Spider): name = 'APA_test' allowed_domains = ['some_domain.com'] start_urls = ['startin_url'] def start_requests(self): for url in self.start_urls: yield SplashRequest(url, self.parse, endpoint='execute', cache_args=['lua_source'], args={'lua_source': script,'timeout': 3600}, headers={'X-My-Header': 'value'}, ) def parse(self, response): for href in response.xpath('//a[@class="product-link"]/@href').extract(): yield SplashRequest(response.urljoin(href),self.parse_produits, endpoint='execute', cache_args=['lua_source'], args={'lua_source': script,'timeout': 3600}, headers={'X-My-Header': 'value'}, ) for pages in response.xpath('//*[@id="loadmore"]/@href'): yield SplashRequest(response.urljoin(pages.extract()),self.parse, endpoint='execute', cache_args=['lua_source'], args={'lua_source': script,'timeout': 3600}, headers={'X-My-Header': 'value'}, ) def parse_produits(self,response): Nom = response.xpath("//h1/text()").extract() Poids = response.xpath('//p[@class="description"]/text()').extract() item_APA = APAitem() item_APA["Titre"] = Nom item_APA["Poids"] = Poids yield item_APA configure_logging() runner = CrawlerRunner() @defer.inlineCallbacks def crawl(): yield runner.crawl(APASpider) reactor.stop() crawl() reactor.run() # the script will block here until the last crawl call is finished

1条回答

网友

1楼 · 发布于 2024-06-26 01:37:28

考虑到问题中没有提供日志消息，很难准确判断问题所在。你知道吗

尽管如此，我还是会尽量回答，因为我之前也遇到过同样的问题。你知道吗

关于splash脚本上的local last_response = entries[#entries].response，scrapy\u splash存在this问题。我想你的剧本里也有，就像我一样。你知道吗

我使用的解决方法是在获取最后一个条目之前检查历史记录是否为空。（由github用户kmike建议）。你知道吗

相关问题更多 >

编程相关推荐

热门问题

热门文章