Python Scrapy Spider:结果不一致

import scrapy from scrapy.linkextractors import LinkExtractor from scrapy.selector import Selector from scrapy.spiders import CrawlSpider, Rule from acer.items import AcerItem class AcercrawlerSpider(CrawlSpider): name = 'acercrawler' allowed_domains = ['studyacer.com'] start_urls = ['http://www.studyacer.com/latest'] rules = ( Rule(LinkExtractor(), callback='parse_item', follow=True), ) def parse_item(self, response): questions= Selector(response).xpath('//td[@class="word-break"]/a/@href').extract() for question in questions: item= AcerItem() item['title']= question.xpath('//h1/text()').extract() item['body']= Selector(response).xpath('//div[@class="row-fluid"][2]//p/text()').extract() yield item

1条回答

网友

1楼 · 发布于 2024-10-04 03:22:49

今天早些时候，我遇到了一个类似但略有不同的问题，我的爬行蜘蛛正在访问不需要的页面。有人回答了我的问题，建议像你在这里建议的那样检查linkextractor：http://doc.scrapy.org/en/latest/topics/link-extractors.html

class scrapy.linkextractors.lxmlhtml.LxmlLinkExtractor(allow=(), deny=(), allow_domains=(), deny_domains=(), deny_extensions=None, restrict_xpaths=(), restrict_css=(), tags=('a', 'area'), attrs=('href', ), canonicalize=True, unique=True, process_value=None)

最后我检查了allow/deny组件，以便将爬虫集中到特定的页面子集。可以指定使用regex来表示要允许（include）或拒绝（exclude）的链接的相关子字符串。我用http://www.regexpal.com/测试了表达式

我发现这种方法足以防止重复，但是如果您仍然看到它们，我还发现了这篇我在今天早些时候看到的关于如何防止重复的文章，尽管我不得不说我不必实现这个修复：

Avoid Duplicate URL Crawling

https://stackoverflow.com/a/21344753/6582364

相关问题更多 >

编程相关推荐

热门问题

热门文章