刮擦简单网站的棘手问题

class JournalSpider(Spider): name = "journal" allowed_domains = ["ametsoc.org"] start_urls = [ "http://journals.ametsoc.org/toc/wefo/current/" ] def parse(self, response): journalTitle = Selector(response).xpath('//*[@id="journalBlurbPanel"]/div[2]/h3/text()').extract()[0] journalIssue = Selector(response).xpath('//*[@id="articleToolsHeading"]/text()').extract()[0].strip() # remove whitespace at start and end # find all articles for the issue and parse each one individually articles = Selector(response).xpath('//div[@id="rightColumn"]//table[@class="articleEntry"]') for article in articles: item = ArticleItem() item['journalTitle'] = journalTitle item['journalIssue'] = journalIssue item['title'] = article.xpath('//div[@class="art_title"]/text()').extract()[0] item['url'] = article.xpath('//a/@href').extract()[0] yield item

1条回答

网友

1楼 · 发布于 2024-09-28 23:45:00

循环中的XPath表达式必须是上下文特定的，并以点开头：

item['title'] = article.xpath('.//div[@class="art_title"]/text()').extract()[0]
item['url'] = article.xpath('.//a/@href').extract()[0]

也可以使用extract_first()方法代替extract()[0]，并使用response.xpath()快捷方式代替Selector(response).xpath()。在

相关问题更多 >

编程相关推荐

热门问题

热门文章