导航到第一个爬网页面中列出的下一个页面

class texashealthspider(CrawlSpider): name="texashealth2" allowed_domains=['www.texashealth.org'] start_urls=['http://jobs.texashealth.org/search/'] rules=( Rule(SgmlLinkExtractor(allow=("startrow=\d",)),callback="parse",follow=True), ) def parse(self, response): hxs=HtmlXPathSelector(response) titles=hxs.select('//tbody/tr/td') items = [] for titles in titles: item=TexashealthItem() item['title']=titles.select('span[@class="jobTitle"]/a/text()').extract() item['link']=titles.select('span[@class="jobTitle"]/a/@href').extract() item['shifttype']=titles.select('span[@class="jobShiftType"]/text()').extract() item['location']=titles.select('span[@class="jobLocation"]/text()').extract() items.append(item) print items return items

1条回答

网友

1楼 · 发布于 2024-09-29 01:28:34

删除allowed_domains=['www.texashealth.org']中的限制，使其成为allowed_domains=['texashealth.org']或{}-否则没有页面将被爬网

顺便说一句，考虑从docs更改函数名：

Warning
When writing crawl spider rules, avoid using parse as callback, since the CrawlSpider uses the parse method itself to implement its logic. So if you override the parse method, the crawl spider will no longer work.

相关问题更多 >

编程相关推荐

热门问题

热门文章