如何从任何网站上爬取表格并存储为数据框架？

<table class="wikitable sortable"> <tbody><tr> <th>Postcode</th> <th>Borough</th> <th>Neighbourhood </th></tr> <tr> <td>M1A</td> <td>Not assigned</td> <td>Not assigned </td></tr> <tr> <td>M2A</td> <td>Not assigned</td> <td>Not assigned </td></tr> <tr> <td>M3A</td> <td><a href="/wiki/North_York" title="North York">North York</a></td> <td><a href="/wiki/Parkwoods" title="Parkwoods">Parkwoods</a> </td></tr> <tr> <td>M4A</td> <td><a href="/wiki/North_York" title="North York">North York</a></td> <td><a href="/wiki/Victoria_Village" title="Victoria Village">Victoria Village</a> </td></tr> ... url = 'https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M' response = requests.get(url) soup= BeautifulSoup(response.text, "html.parser") table = soup.find('table', {'class': 'wikitable sortable'}) df = [] for row in table.find_all('tr'): columns = row.find_all('td') Postcode = row.columns[1].get_text() Borough = row.columns[2].get_text() Neighbourhood = row.column[3].get_text() df.append([Postcode,Borough,Neighbourhood])

3条回答

网友

1楼 · 编辑于 2024-09-26 17:46:25

我不知道熊猫，但我用这个脚本刮桌子。希望对你有帮助。你知道吗

import requests
from bs4 import BeautifulSoup
url = 'https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M'
response = requests.get(url)
soup= BeautifulSoup(response.text, "html.parser")

tbl= soup.find('table', {'class': 'wikitable sortable'})
table_dict = {
    "head": [th.text.strip() for th in tbl.find_all('th')],
    "rows": [
        [td.text.strip() for td in tr.find_all("td")]
            for tr in tbl.find_all("tr")
                if not tr.find("th")
    ]
}

网友

2楼 · 编辑于 2024-09-26 17:46:25

如果你想从网上刮一张桌子，你可以使用熊猫图书馆。你知道吗

import pandas as pd
url = 'valid_url'
df = pd.read_html(url)
print(df[0].head())

网友

3楼 · 编辑于 2024-09-26 17:46:25

刮削的代码在下面的部分是错误的。你知道吗

import requests
from bs4 import BeautifulSoup
url = 'https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M'
response = requests.get(url)
soup= BeautifulSoup(response.text, "html.parser")

table = soup.find('table', {'class': 'wikitable sortable'})

df = []

for row in table.find_all('tr'):
    columns = row.find_all('td') # the first row is returning <th> tags, but since you queried <td> tags, it's returning empty list.
    if len(columns)>0: #In order to skip first row or in general, empty rows, you need to put an if check.
        #Use the indices properly to get different values.
        Postcode = columns[0].get_text()
        Borough =columns[1].get_text()
        Neighbourhood = columns[2].get_text()
        df.append([Postcode,Borough,Neighbourhood])

再次，请注意，使用get\u text也会完整地返回链接和锚定标记。你可能想修改代码来避免这种情况。开心网抓拍：）

相关问题更多 >

编程相关推荐

热门问题

热门文章