<p>您可以在代码中更改以下几点:</p>
<ol>
<li>您不需要创建/导入选择器,response对象有.css()和.xpath方法,它们是选择器的快捷方式。<a href="https://doc.scrapy.org/en/latest/topics/request-response.html#scrapy.http.TextResponse.xpath" rel="nofollow noreferrer">Docs</a></li>
<li>HtmlXPathSelector被取消权限,您应该使用用户选择器的(或者更确切地说是响应的).xpath()方法</li>
<li>.extract()将生成一个URL数组,因此您将无法对该数组调用请求,您应该先在此处使用extract_()</li>
</ol>
<p>应用这些要点:</p>
<pre><code># -*- coding: utf-8 -*-
from scrapy.contrib.spiders import CrawlSpider
from scrapy.http import Request
class YourCrawler(CrawlSpider):
name = "***"
start_urls = [
'http://www.***.com/10000000000177/',
]
allowed_domains = ["http://www.***.com/"]
def parse(self, response):
page_list_urls = response.css('#content-center > div.listado_libros.gwe_libros > div > form > dl.dublincore > dd.title > a::attr(href)').extract()
for url in page_list_urls:
yield Request(response.urljoin(url), callback=self.parse_following_urls, dont_filter=True)
next_page = response.xpath(u"//*[@id='content-center']/div[@class='paginador']/a[text()='\u00bb']/@href").extract_first()
if next_page is not None:
next_page = response.urljoin(next_page)
yield Request(next_page, callback=self.parse)
def parse_following_urls(self, response):
for each_book in response.css('div#container'):
yield {
'title': each_book.css('div#content > div#primary > div > h1.title-book::text').extract(),
}
</code></pre>