使用Scrapy使用“加载更多”按钮刮取无限滚动页

class JobsetSpider(CrawlSpider): name = 'test' allowed_domains = ['jobs.et'] start_urls = ['https://jobs.et/jobs/'] rules = ( Rule(LinkExtractor(allow='https://jobs.et/job/\d+/'), callback='parse_link'), Rule(LinkExtractor(), follow=True), ) def parse_link(self, response): yield { 'url': response.url }

2条回答

网友

1楼 · 编辑于 2024-05-02 07:52:45

忽略“加载更多”按钮。

正如您所提到的，您可以使用url访问作业的所有页面。分析结果的第一页时，请从header元素中查找作业总数

<h1 class="search-results__title ">
268 jobs found
</h1>

该网站每页显示20个作业，因此您需要刮取268/20=13.4（四舍五入到14）页。

解析完第一个页面后，创建一个生成器以生成后续页面的url（最多14个循环），并使用另一个函数解析结果。您将需要searchId，您无法从URL中获取，但它位于页面上的隐藏字段中。

<input type="hidden" name="searchId" value="1509738711.5142">

使用这个和页码，你可以建立你的网址

https://jobs.et/jobs/?searchId=<id>&action=search&page=<page>

是的，parse函数的工作方式与第一页解析器完全相同，但是当您完成所有工作时，最好使用代码复制来保持头脑清醒。

代码可能是

class JobsetSpider(CrawlSpider):
    ...
    start_urls = ['https://jobs.et/jobs/']
    ...

    def parse(self, response):
        # parse the page of jobs
        ...
        job_count = xpath(...)
        search_id = xpath(...)
        pages =  math.ceil(job_count / 20.0)
        for page in range(2, pages):
            url = 'https://jobs.et/jobs/?searchId={}&action=search&page={}'.format(search_id, page)
            yield Request(url, callback = self.parseNextPage)

    def parseNextPage(self, response):
        # parse the next and subsequent pages of jobs
        ...

网友

2楼 · 编辑于 2024-05-02 07:52:45

您可以添加如下内容：

has_next = response.css('.load-more').extract()
if has_next:
    next_page = response.meta.get('next_page', 1) + 1
    url = response.urljoin(response.css('script').re_first("'(\?searchId.*page=)'") + str(next_page))
    yield Request(url , meta={'next_page': next_page})

相关问题更多 >

编程相关推荐

热门问题

热门文章