在一个页面上跟踪特定的链接

from scrapy.contrib.spiders import CrawlSpider,Rule from scrapy.contrib.linkextractors import LinkExtractor class ohhlaSpider(CrawlSpider): name = "ohhla" download_delay = 0.5 allowed_domains = ["ohhla.com"] start_urls = ["http://www.ohhla.com/anonymous/aesoprck/"] rules = (Rule (LinkExtractor(restrict_xpaths=('//h3/a/@href',)), follow= True), # trying to follow links to pages with more links to artist pages Rule (LinkExtractor(restrict_xpaths=('//pre/a/@href',)), follow= True), # trying to follow links to artist pages Rule (LinkExtractor(deny_extensions=("txt"),restrict_xpaths=('//ul/li',)), follow= True), # succeeding in following links to album pages Rule (LinkExtractor(restrict_xpaths=('//ul/li',)), callback="extract_text", follow= False),) # succeeding in extracting lyrics from the songs on album pages def extract_text(self, response): """ extract text from webpage""" string = response.xpath('//pre/text()').extract()[0] with open("lyrics.txt", 'wb') as f: f.write(string)

2条回答

网友

1楼 · 编辑于 2024-09-30 08:36:38

restrict_xpaths不应指向@href属性。它应该指向链接提取器将搜索链接的位置：

Rule(LinkExtractor(restrict_xpaths='//h3'), follow=True)

请注意，您可以将其指定为字符串而不是元组。在

您还可以allow所有包含all*.html的链接：

^{pr2}$

你还应该确保你的蜘蛛确实在访问“父目录”页面。开始爬行听起来很合理，因为这是目录的索引页：

start_urls = ["http://www.ohhla.com/all.html"]

网友

2楼 · 编辑于 2024-09-30 08:36:38

第二部分这个答案对于抓取网页中的特定链接很有用。https://stackoverflow.com/a/40146522/4418897

相关问题更多 >

编程相关推荐

热门问题

热门文章