Scrapy和xpath使用文本“»”查找<a>

<div id="content-center"> <div class="paginador"> <span class="current">01</span> <a href="ml=0">02</a> <a href="ml=0">03</a> <a href="ml=0">04</a> <a href="ml=0">»</a> <a href="ml=0">Last</a> </div> </div>

3条回答

网友

1楼 · 编辑于 2024-09-30 06:26:45

我想^{}能胜任这项工作

data = '''
<div class="pages">
  <span class="current">01</span>
  <a href="ml=0">02</a>
  <a href="ml=0">03</a>
  <a href="ml=0">04</a>
  <a href="ml=0">05</a>
  <a href="ml=0">06</a>
  <a href="ml=0">07</a>
  <a href="ml=0">08</a>
  <a href="ml=0">09</a>
  <a href="ml=0">10</a>
  <a href="ml=0">»</a>
  <a href="ml=0">Last</a>
</div>

from bs4 import BeautifulSoup
bsobj = BeautifulSoup(data, 'html.parser')
for a in bsobj.find_all('a'):
   if a.text == '»':
      print(a['href'])

网友

2楼 · 编辑于 2024-09-30 06:26:45

您可以在代码中更改以下几点：

您不需要创建/导入选择器，response对象有.css（）和.xpath方法，它们是选择器的快捷方式。Docs
HtmlXPathSelector被取消权限，您应该使用用户选择器的（或者更确切地说是响应的）.xpath（）方法
.extract（）将生成一个URL数组，因此您将无法对该数组调用请求，您应该先在此处使用extract_（）

应用这些要点：

# -*- coding: utf-8 -*-
from scrapy.contrib.spiders import CrawlSpider
from scrapy.http import Request


class YourCrawler(CrawlSpider):
    name = "***"
    start_urls = [
        'http://www.***.com/10000000000177/',
    ]
    allowed_domains = ["http://www.***.com/"]

    def parse(self, response):
        page_list_urls = response.css('#content-center > div.listado_libros.gwe_libros > div > form > dl.dublincore > dd.title > a::attr(href)').extract()
        for url in page_list_urls:
            yield Request(response.urljoin(url), callback=self.parse_following_urls, dont_filter=True)
        next_page = response.xpath(u"//*[@id='content-center']/div[@class='paginador']/a[text()='\u00bb']/@href").extract_first()
        if next_page is not None:
            next_page = response.urljoin(next_page)
            yield Request(next_page, callback=self.parse)

    def parse_following_urls(self, response):
        for each_book in response.css('div#container'):
            yield {
                'title': each_book.css('div#content > div#primary > div > h1.title-book::text').extract(),
            }

网友

3楼 · 编辑于 2024-09-30 06:26:45

尝试使用\u-转义版本的»：

>>> print(u'\u00bb')
»

就像在您的.xpath()调用中一样（注意字符串参数的u"..."前缀）：

^{pr2}$

spider.py文件可能正在使用UTF-8：

>>> u'\u00bb'.encode('utf-8')
'\xc2\xbb'

因此，您也可以使用hxs.select(u"//a[text()='»']/@href").extract()（前缀仍然存在），但是您还需要告诉Python您的.py编码是什么。在

通常在.py文件的顶部使用# -*- coding: utf-8 -*-（或等效文件）（例如第一行）。在

您可以阅读更多关于Python源代码编码声明here和here。在

相关问题更多 >

编程相关推荐

热门问题

热门文章