XPath语句未按预期进行分析

2024-09-29 00:18:50 发布

您现在位置:Python中文网/ 问答频道 /正文

下面是我试图从中选择2016的HTML。

<span id="titleYear">
  "("
  <a href="/year/2016/?ref_=tt_ov_inf">2016</a>
  ")"
</span>

下面是XPath语句://span[@id='titleYear']/a/text()

不幸的是,该语句出于某种原因选择了<a href="/year/2016/?ref_=tt_ov_inf">2016</a>。你知道吗

//span[@id='titleYear']/a/text()返回与//span[@id='titleYear']/a//span[@id='titleYear']/a[text()]相同的结果。你知道吗

为什么text()在这种情况下没有效果?

是因为2016不被视为文本节点吗?

值得注意的是,我将Anaconda与python3.6.5和scrapy1.5.0结合使用。你知道吗

Python脚本

import scrapy

class IMDBcrawler(scrapy.Spider):
    name = 'imdb'
    def start_requests(self):
        pages = []
        count = 1
        limit = 10
        while (count <= limit):
            str_number = '%07d' % count
            pages.append('https://www.imdb.com/title/tt' + str_number)
            count += 1
        for url in pages:
            yield scrapy.Request(url=url, callback=self.parse)
    def parse(self, response):
        yield {
            'nom': response.xpath('//div[@class="title_wrapper"]/h1/text()').extract_first(),
            'ano': response.xpath('//span[@id="titleYear"]/a/text()').extract_first(),
        }

输出

[
  {
    "nom": "Chinese Opium Den\u00a0",
    "ano": "<a href=\"\/year\/1894\/?ref_=tt_ov_inf\">1894<\/a>"
  },
  {
    "nom": "Pauvre Pierrot\u00a0",
    "ano": "<a href=\"\/year\/1892\/?ref_=tt_ov_inf\">1892<\/a>"
  },
  {
    "nom": "Carmencita\u00a0",
    "ano": "<a href=\"\/year\/1894\/?ref_=tt_ov_inf\">1894<\/a>"
  },
  {
    "nom": "Un bon bock\u00a0",
    "ano": "<a href=\"\/year\/1892\/?ref_=tt_ov_inf\">1892<\/a>"
  },
  {
    "nom": "Blacksmith Scene\u00a0",
    "ano": "<a href=\"\/year\/1893\/?ref_=tt_ov_inf\">1893<\/a>"
  },
  {
    "nom": "Corbett and Courtney Before the Kinetograph\u00a0",
    "ano": "<a href=\"\/year\/1894\/?ref_=tt_ov_inf\">1894<\/a>"
  },
  {
    "nom": "Employees Leaving the Lumi\u00e8re Factory\u00a0",
    "ano": "<a href=\"\/year\/1895\/?ref_=tt_ov_inf\">1895<\/a>"
  },
  {
    "nom": "Miss Jerry\u00a0",
    "ano": "<a href=\"\/year\/1894\/?ref_=tt_ov_inf\">1894<\/a>"
  },
  {
    "nom": "Le clown et ses chiens\u00a0",
    "ano": "<a href=\"\/year\/1892\/?ref_=tt_ov_inf\">1892<\/a>"
  },
  {
    "nom": "Edison Kinetoscopic Record of a Sneeze\u00a0",
    "ano": "<a href=\"\/year\/1894\/?ref_=tt_ov_inf\">1894<\/a>"
  }
]

谢谢。你知道吗


Tags: textrefidcountyearnominfscrapy
1条回答
网友
1楼 · 发布于 2024-09-29 00:18:50

不确定使用Scrapy的问题是什么,但是在请求的帮助下直接使用lxml,使用findtext的更简单的xpath效果很好:

import requests

from lxml import html

pages = []

for count in range(1, 10):
    str_num = '%07d' % count
    res = html.fromstring(requests.get('https://www.imdb.com/title/tt' + str_num).text)
    pages.append({'nom': res.findtext('.//div[@class="title_wrapper"]/h1'), 'ano': res.findtext('.//span[@id="titleYear"]/a')})

结果:

In [40]: pages
Out[40]:
[{'ano': '1894', 'nom': 'Carmencita\xa0'},
 {'ano': '1892', 'nom': 'Le clown et ses chiens\xa0'},
 {'ano': '1892', 'nom': 'Pauvre Pierrot\xa0'},
 {'ano': '1892', 'nom': 'Un bon bock\xa0'},
 {'ano': '1893', 'nom': 'Blacksmith Scene\xa0'},
 {'ano': '1894', 'nom': 'Chinese Opium Den\xa0'},
 {'ano': '1894', 'nom': 'Corbett and Courtney Before the Kinetograph\xa0'},
 {'ano': '1894', 'nom': 'Edison Kinetoscopic Record of a Sneeze\xa0'},
 {'ano': '1894', 'nom': 'Miss Jerry\xa0'}]

相关问题 更多 >