如何下载html表格内容?

2024-09-30 00:39:28 发布

您现在位置:Python中文网/ 问答频道 /正文

我想从以下网站下载财务数据(“konsernregnskap”而不是“morregnskap”),但我不确定如何下载所有内容:https://www.proff.no/regnskap/yara-international-asa/oslo/hovedkontortjenester/IGB6AV410NZ/

尝试用xpath查找表,但没有成功。你知道吗

我想把所有的内容下载到一张excel表格中。你知道吗


Tags: nohttps内容网站wwwosloyara财务数据
2条回答

在类table wrap中有8个表,前4个表属于“morregnskap”选项卡,后4个表属于“konsernregnskap”选项卡,因此选择后4个表就是选择所需的表,从中可以开始刮取数据

import requests
import json
import bs4

url = 'https://www.proff.no/regnskap/yara-international-asa/oslo/hovedkontortjenester/IGB6AV410NZ/'


response = requests.get(url)
soup = bs4.BeautifulSoup(response.text, 'lxml')
tables = soup.find_all('div', {'table-wrap'})


konsernregnskap_data = tables[5:]

@rusu\u ro1给出的答案是正确的。然而,我认为Pandas是适合job的工具。你知道吗

可以使用pandas.read_html获取页面中的所有表。然后使用pandas.DataFrame.to_excel只将最后4个表写入excel工作簿。你知道吗

下面的脚本将刮取数据并将每个表写入不同的工作表。你知道吗

import pandas as pd
all_tables = pd.read_html(
    "https://www.proff.no/regnskap/yara-international-asa/oslo/hovedkontortjenester/IGB6AV410NZ/"
)
with pd.ExcelWriter('output.xlsx') as writer:
    # Last 4 tables has the 'konsernregnskap' data
    for idx, df in enumerate(all_tables[4:8]):
        # Remove last column (empty)
        df = df.drop(df.columns[-1], axis=1)
        df.to_excel(writer, "Table {}".format(idx))

备注:

flavor : str or None, container of strings

The parsing engine to use. ‘bs4’ and ‘html5lib’ are synonymous with each other, they are both there for backwards compatibility. The default of None tries to use lxml to parse and if that fails it falls back on bs4 + html5lib.

HTML Table Parsing Gotchas

html5lib generates valid HTML5 markup from invalid markup automatically. This is extremely important for parsing HTML tables, since it guarantees a valid document. However, that does NOT mean that it is “correct”, since the process of fixing markup does not have a single definition.

在您的特定情况下,它会删除第5个表(只返回7)。也许第1和第5个表中的b'coz都有相同的数据。你知道吗

相关问题 更多 >

    热门问题