import pandas as pd
all_tables = pd.read_html(
"https://www.proff.no/regnskap/yara-international-asa/oslo/hovedkontortjenester/IGB6AV410NZ/"
)
with pd.ExcelWriter('output.xlsx') as writer:
# Last 4 tables has the 'konsernregnskap' data
for idx, df in enumerate(all_tables[4:8]):
# Remove last column (empty)
df = df.drop(df.columns[-1], axis=1)
df.to_excel(writer, "Table {}".format(idx))
The parsing engine to use. ‘bs4’ and ‘html5lib’ are synonymous with
each other, they are both there for backwards compatibility. The
default of None tries to use lxml to parse and if that fails it falls
back on bs4 + html5lib.
html5lib generates valid HTML5 markup from invalid markup
automatically. This is extremely important for parsing HTML tables,
since it guarantees a valid document. However, that does NOT mean that
it is “correct”, since the process of fixing markup does not have a
single definition.
在类table wrap中有8个表,前4个表属于“morregnskap”选项卡,后4个表属于“konsernregnskap”选项卡,因此选择后4个表就是选择所需的表,从中可以开始刮取数据
@rusu\u ro1给出的答案是正确的。然而,我认为Pandas是适合job的工具。你知道吗
可以使用pandas.read_html获取页面中的所有表。然后使用pandas.DataFrame.to_excel只将最后4个表写入excel工作簿。你知道吗
下面的脚本将刮取数据并将每个表写入不同的工作表。你知道吗
备注:
pip install lxml
从HTML Table Parsing Gotchas
在您的特定情况下,它会删除第5个表(只返回7)。也许第1和第5个表中的b'coz都有相同的数据。你知道吗
相关问题 更多 >
编程相关推荐