使用BeautifulSoup的Web刮取无法提取表行

2024-05-04 10:21:52 发布

您现在位置:Python中文网/ 问答频道 /正文

我正在尝试使用BeautifulSoup提取以下网页上的表:

https://www.indiapost.gov.in/VAS/Pages/PMODashboard/DistributionOfPostOffices.aspx

我尝试使用的代码是:


import pandas as pd
from urllib.request import urlopen
from bs4 import BeautifulSoup

url = "https://www.indiapost.gov.in/VAS/Pages/PMODashboard/DistributionOfPostOffices.aspx"
html = urlopen(url)

soup = BeautifulSoup(html, 'lxml')
type(soup)

table = soup.find('table', {'class' : 'tbl'})

#extract rows:

rows = soup.find_all('tr')

最后一行应该沿着带有HTML标记的行名打印输出(如Sl No、Head Post Office等),但它只打印一个空列表。我哪里出错了?你知道吗


Tags: infromhttpsimportwwwpagesurlopengov
1条回答
网友
1楼 · 发布于 2024-05-04 10:21:52

您可能需要遵循以下方法,使用请求从该网页获取表格内容。事实证明,您要查找的内容在这个link中可用,您可以使用chrome开发工具找到它。你知道吗

工作代码:

import csv
import requests
from bs4 import BeautifulSoup

url = 'https://www.indiapost.gov.in/Documents/DashboardXmlFile/DashboardXML.xml'

def get_tabular_info(link):
    r = requests.get(link)
    soup = BeautifulSoup(r.text,'xml')
    tabular_list = []
    for items in soup.select("DistributionOfPostOffices Table1")[2:]:
        tabular_list.append([item.get_text(strip=True) for item in items.select("A,B,C,D,E,F")])
    return tabular_list

if __name__ == '__main__':
    with open("output_indiapost.csv","w",newline="") as f:
        writer = csv.writer(f)

        for item in get_tabular_info(url):
            writer.writerow(item)
            print(item)

输出如下:

['Sl. No.', 'Circle Name', 'Head Post Office', 'Sub Post Office', 'Branch Post Office', 'Letter Box']
['1', 'Andhra Pradesh Circle', '59', '1535', '8897', '29510']
['2', 'Assam Circle', '19', '606', '3385', '12427']
['3', 'Bihar Circle', '32', '1029', '8031', '22433']
['4', 'Chhattisgarh Circle', '11', '341', '3079', '14988']
['5', 'Delhi Circle', '12', '406', '142', '1187']
['6', 'Gujarat Circle', '33', '1243', '7651', '24377']

相关问题 更多 >