导航到第一个爬网页面中列出的下一个页面

2024-09-29 01:28:34 发布

您现在位置:Python中文网/ 问答频道 /正文

嗨,我需要以下代码的帮助,以导航和获取数据从其余的网页中提到的链接在开始的网址。请帮忙

class texashealthspider(CrawlSpider):

    name="texashealth2"
    allowed_domains=['www.texashealth.org']
    start_urls=['http://jobs.texashealth.org/search/']

    rules=(
        Rule(SgmlLinkExtractor(allow=("startrow=\d",)),callback="parse",follow=True),
        )

    def parse(self, response):
        hxs=HtmlXPathSelector(response)
        titles=hxs.select('//tbody/tr/td')
        items = []

    for titles in titles:
        item=TexashealthItem()
        item['title']=titles.select('span[@class="jobTitle"]/a/text()').extract()
        item['link']=titles.select('span[@class="jobTitle"]/a/@href').extract()
        item['shifttype']=titles.select('span[@class="jobShiftType"]/text()').extract()
        item['location']=titles.select('span[@class="jobLocation"]/text()').extract()
        items.append(item)
    print items
    return items

Tags: 代码textorgparseresponseextractitemsitem
1条回答
网友
1楼 · 发布于 2024-09-29 01:28:34

删除allowed_domains=['www.texashealth.org']中的限制,使其成为allowed_domains=['texashealth.org']或{}-否则没有页面将被爬网

顺便说一句,考虑从docs更改函数名:

Warning

When writing crawl spider rules, avoid using parse as callback, since the CrawlSpider uses the parse method itself to implement its logic. So if you override the parse method, the crawl spider will no longer work.

相关问题 更多 >