下面是我试图从中选择2016的HTML。
<span id="titleYear">
"("
<a href="/year/2016/?ref_=tt_ov_inf">2016</a>
")"
</span>
下面是XPath语句://span[@id='titleYear']/a/text()
不幸的是,该语句出于某种原因选择了<a href="/year/2016/?ref_=tt_ov_inf">2016</a>
。你知道吗
//span[@id='titleYear']/a/text()
返回与//span[@id='titleYear']/a
和//span[@id='titleYear']/a[text()]
相同的结果。你知道吗
为什么text()
在这种情况下没有效果?
是因为2016
不被视为文本节点吗?
值得注意的是,我将Anaconda与python3.6.5和scrapy1.5.0结合使用。你知道吗
Python脚本
import scrapy
class IMDBcrawler(scrapy.Spider):
name = 'imdb'
def start_requests(self):
pages = []
count = 1
limit = 10
while (count <= limit):
str_number = '%07d' % count
pages.append('https://www.imdb.com/title/tt' + str_number)
count += 1
for url in pages:
yield scrapy.Request(url=url, callback=self.parse)
def parse(self, response):
yield {
'nom': response.xpath('//div[@class="title_wrapper"]/h1/text()').extract_first(),
'ano': response.xpath('//span[@id="titleYear"]/a/text()').extract_first(),
}
输出
[
{
"nom": "Chinese Opium Den\u00a0",
"ano": "<a href=\"\/year\/1894\/?ref_=tt_ov_inf\">1894<\/a>"
},
{
"nom": "Pauvre Pierrot\u00a0",
"ano": "<a href=\"\/year\/1892\/?ref_=tt_ov_inf\">1892<\/a>"
},
{
"nom": "Carmencita\u00a0",
"ano": "<a href=\"\/year\/1894\/?ref_=tt_ov_inf\">1894<\/a>"
},
{
"nom": "Un bon bock\u00a0",
"ano": "<a href=\"\/year\/1892\/?ref_=tt_ov_inf\">1892<\/a>"
},
{
"nom": "Blacksmith Scene\u00a0",
"ano": "<a href=\"\/year\/1893\/?ref_=tt_ov_inf\">1893<\/a>"
},
{
"nom": "Corbett and Courtney Before the Kinetograph\u00a0",
"ano": "<a href=\"\/year\/1894\/?ref_=tt_ov_inf\">1894<\/a>"
},
{
"nom": "Employees Leaving the Lumi\u00e8re Factory\u00a0",
"ano": "<a href=\"\/year\/1895\/?ref_=tt_ov_inf\">1895<\/a>"
},
{
"nom": "Miss Jerry\u00a0",
"ano": "<a href=\"\/year\/1894\/?ref_=tt_ov_inf\">1894<\/a>"
},
{
"nom": "Le clown et ses chiens\u00a0",
"ano": "<a href=\"\/year\/1892\/?ref_=tt_ov_inf\">1892<\/a>"
},
{
"nom": "Edison Kinetoscopic Record of a Sneeze\u00a0",
"ano": "<a href=\"\/year\/1894\/?ref_=tt_ov_inf\">1894<\/a>"
}
]
谢谢。你知道吗
不确定使用
Scrapy
的问题是什么,但是在请求的帮助下直接使用lxml
,使用findtext
的更简单的xpath
效果很好:结果:
相关问题 更多 >
编程相关推荐