如何下载html表格内容？

2条回答

网友

1楼 · 编辑于 2024-09-30 00:39:28

在类table wrap中有8个表，前4个表属于“morregnskap”选项卡，后4个表属于“konsernregnskap”选项卡，因此选择后4个表就是选择所需的表，从中可以开始刮取数据

import requests
import json
import bs4

url = 'https://www.proff.no/regnskap/yara-international-asa/oslo/hovedkontortjenester/IGB6AV410NZ/'


response = requests.get(url)
soup = bs4.BeautifulSoup(response.text, 'lxml')
tables = soup.find_all('div', {'table-wrap'})


konsernregnskap_data = tables[5:]

网友

2楼 · 编辑于 2024-09-30 00:39:28

@rusu\u ro1给出的答案是正确的。然而，我认为Pandas是适合job的工具。你知道吗

可以使用pandas.read_html获取页面中的所有表。然后使用pandas.DataFrame.to_excel只将最后4个表写入excel工作簿。你知道吗

下面的脚本将刮取数据并将每个表写入不同的工作表。你知道吗

import pandas as pd
all_tables = pd.read_html(
    "https://www.proff.no/regnskap/yara-international-asa/oslo/hovedkontortjenester/IGB6AV410NZ/"
)
with pd.ExcelWriter('output.xlsx') as writer:
    # Last 4 tables has the 'konsernregnskap' data
    for idx, df in enumerate(all_tables[4:8]):
        # Remove last column (empty)
        df = df.drop(df.columns[-1], axis=1)
        df.to_excel(writer, "Table {}".format(idx))

备注：

你也可以write all the DataFrames to a single sheet。你知道吗
确保已安装lxml库。pip install lxml

flavor : str or None, container of strings
The parsing engine to use. ‘bs4’ and ‘html5lib’ are synonymous with each other, they are both there for backwards compatibility. The default of None tries to use lxml to parse and if that fails it falls back on bs4 + html5lib.

从HTML Table Parsing Gotchas

html5lib generates valid HTML5 markup from invalid markup automatically. This is extremely important for parsing HTML tables, since it guarantees a valid document. However, that does NOT mean that it is “correct”, since the process of fixing markup does not have a single definition.

在您的特定情况下，它会删除第5个表（只返回7）。也许第1和第5个表中的b'coz都有相同的数据。你知道吗

相关问题更多 >

编程相关推荐

热门问题

热门文章

如何下载html表格内容？

相关问题 更多 >

编程相关推荐

热门问题

热门文章

相关问题更多 >