嗨,我需要以下代码的帮助,以导航和获取数据从其余的网页中提到的链接在开始的网址。请帮忙
class texashealthspider(CrawlSpider):
name="texashealth2"
allowed_domains=['www.texashealth.org']
start_urls=['http://jobs.texashealth.org/search/']
rules=(
Rule(SgmlLinkExtractor(allow=("startrow=\d",)),callback="parse",follow=True),
)
def parse(self, response):
hxs=HtmlXPathSelector(response)
titles=hxs.select('//tbody/tr/td')
items = []
for titles in titles:
item=TexashealthItem()
item['title']=titles.select('span[@class="jobTitle"]/a/text()').extract()
item['link']=titles.select('span[@class="jobTitle"]/a/@href').extract()
item['shifttype']=titles.select('span[@class="jobShiftType"]/text()').extract()
item['location']=titles.select('span[@class="jobLocation"]/text()').extract()
items.append(item)
print items
return items
删除}-否则没有页面将被爬网
allowed_domains=['www.texashealth.org']
中的限制,使其成为allowed_domains=['texashealth.org']
或{顺便说一句,考虑从docs更改函数名:
相关问题 更多 >
编程相关推荐