使用Python如何下载一个网页的所有页面的数据,该网页的所有页面都有相同的链接

2024-10-01 07:30:58 发布

您现在位置:Python中文网/ 问答频道 /正文

我正在尝试下载此网页上显示的表中的所有条目-https://udhonline.rajasthan.gov.in/Portal/AuctionList 有一些按钮可以加载表中的下一个条目,但网页的链接保持不变。 我想下载Python中的所有数据,我尝试了以下方法:

pd.read_html(link)

这将生成一个列表,其中包含表中的前30个结果,以及另一个包含所有30个结果的项目。页面上仅显示前30个结果的默认设置。如何从以下所有页面获取数据


Tags: 数据方法inhttps网页链接条目页面
2条回答

您可以使用此示例来说明如何将多个页面中的数据加载到dataframe中:

import requests
import pandas as pd
from bs4 import BeautifulSoup

api_url = "https://udhonline.rajasthan.gov.in/Portal/SearchAuctionGrid"

params = {
    "page": "1",
    "Paging": "True",
    "pageSize": "30",
    "TabViewType": "0",
    "UnitId": "0",
}

dfs = []
for page in range(1, 4):  # <  increase number of pages here
    params["page"] = page
    soup = BeautifulSoup(
        requests.post(api_url, params=params).content, "html.parser"
    )
    for t in soup.select("table:not(:has(table))"):
        dfs.append(pd.read_html(str(t))[0].T)

df = pd.concat(dfs).reset_index(drop=True)
print(df)
df.to_csv("data.csv", index=False)

印刷品:

                                                     0                                                  1                                    2                                             3                                           4                       5                                                  6
0         AROGYA NAGAR RESIDENTIAL PLOT NO. 220 [9263]                          Scheme Name: Arogya Nagar                 Property Number: 220  EMD Deposit Start Date: 01-Jun-2021 08:00 AM  EMD Deposit End Date: 06-Jun-2021 11:59 PM    EMD Deposit Ends In:  Assessed Property Value as per Bid Start Price...
1         AROGYA NAGAR RESIDENTIAL PLOT NO. 220 [9263]                          Scheme Name: Arogya Nagar                 Property Number: 220  EMD Deposit Start Date: 01-Jun-2021 08:00 AM  EMD Deposit End Date: 06-Jun-2021 11:59 PM    EMD Deposit Ends In:  Assessed Property Value as per Bid Start Price...
2         AROGYA NAGAR RESIDENTIAL PLOT NO. 220 [9263]                          Scheme Name: Arogya Nagar   Property Area: 2118.32 Square Feet          Bid Start Date: 03-Jun-2021 10:00 AM          Bid End Date: 07-Jun-2021 11:00 AM            Bid Ends In:  Assessed Property Value as per Bid Start Price...
3         AROGYA NAGAR RESIDENTIAL PLOT NO. 220 [9263]                          Scheme Name: Arogya Nagar   Property Area: 2118.32 Square Feet          Bid Start Date: 03-Jun-2021 10:00 AM          Bid End Date: 07-Jun-2021 11:00 AM            Bid Ends In:  Assessed Property Value as per Bid Start Price...
4         AROGYA NAGAR RESIDENTIAL PLOT NO. 220 [9263]                          Scheme Name: Arogya Nagar              Usage Type: Residential                   EMD Amount (Rs.): 211900.00                    View Details Participate                     NaN  Assessed Property Value as per Bid Start Price...
5         AROGYA NAGAR RESIDENTIAL PLOT NO. 220 [9263]                          Scheme Name: Arogya Nagar              Usage Type: Residential                   EMD Amount (Rs.): 211900.00                    View Details Participate                     NaN  Assessed Property Value as per Bid Start Price...
6          CHANAKYAPURI RESIDENTIAL PLOT NO. 14 [9262]                          Scheme Name: Chanakyapuri                  Property Number: 14  EMD Deposit Start Date: 01-Jun-2021 08:00 AM  EMD Deposit End Date: 06-Jun-2021 11:59 PM    EMD Deposit Ends In:  Assessed Property Value as per Bid Start Price...
7          CHANAKYAPURI RESIDENTIAL PLOT NO. 14 [9262]                          Scheme Name: Chanakyapuri                  Property Number: 14  EMD Deposit Start Date: 01-Jun-2021 08:00 AM  EMD Deposit End Date: 06-Jun-2021 11:59 PM    EMD Deposit Ends In:  Assessed Property Value as per Bid Start Price...

...

并保存data.csv(LibreOffice的屏幕截图):

enter image description here

我在第页做了列举

  1. 当您点击第2页时,您实际上向服务器发出了post请求,请参见下面的curl请求

curl "https://udhonline.rajasthan.gov.in/Portal/SearchAuctionGrid?page=2&Paging=True&pageSize=30&TabViewType=0&UnitId=0" -H "Content-Type: application/x-www-form-urlencoded; charset=UTF-8" -H "X-Requested-With: XMLHttpRequest" -H "Origin: https://udhonline.rajasthan.gov.in" -H "Connection: keep-alive" -H "Referer: https://udhonline.rajasthan.gov.in/Portal/AuctionList" data-raw "X-Requested-With=XMLHttpRequest"

因此,只需向 https://udhonline.rajasthan.gov.in/Portal/SearchAuctionGrid?page=2&Paging=True&pageSize=30&TabViewType=0&UnitId=0

用这个标题 X-Requested-With=XMLHttpRequest 还有这个身体 X-Requested-With=XMLHttpRequest

  1. 用于第页上的项目计数

向此url发出post请求 https://udhonline.rajasthan.gov.in/Portal/SearchAuctionGrid

用这个标题

X-Requested-With=XMLHttpRequest

有了这些数据

PageSize=50&UnitId=&X-Requested-With=XMLHttpRequest

相关问题 更多 >