如何使用Beautifulsoup在python中抓取下一页

2024-05-19 07:22:08 发布

您现在位置:Python中文网/ 问答频道 /正文

假设我正在抓取一个网址

http://www.engineering.careers360.com/colleges/list-of-engineering-colleges-in-India?sort_filter=alpha

它不包含包含我要抓取的数据的页面。那么我怎样才能把下一页的数据都抓取出来呢。 我使用的是python3.5.1和beauthoulsoup。 注意:我不能使用scrapy和lxml,因为它会给我一些安装错误。在


Tags: of数据inalphacomhttpwwwfilter
1条回答
网友
1楼 · 发布于 2024-05-19 07:22:08

通过提取“转到最后一页”元素的page参数来确定最后一页。并通过^{}遍历每个维护web抓取会话的页面:

import re

import requests
from bs4 import BeautifulSoup


with requests.Session() as session:
    # extract the last page
    response = session.get("http://www.engineering.careers360.com/colleges/list-of-engineering-colleges-in-India?sort_filter=alpha")    
    soup = BeautifulSoup(response.content, "html.parser")
    last_page = int(re.search("page=(\d+)", soup.select_one("li.pager-last").a["href"]).group(1))

    # loop over every page
    for page in range(last_page):
        response = session.get("http://www.engineering.careers360.com/colleges/list-of-engineering-colleges-in-India?sort_filter=alpha&page=%f" % page)
        soup = BeautifulSoup(response.content, "html.parser")

        # print the title of every search result
        for result in soup.select("li.search-result"):
            title = result.find("div", class_="title").get_text(strip=True)
            print(title)

印刷品:

^{2}$

相关问题 更多 >

    热门问题