如何使用Python从网页中提取表的内容?

2024-06-25 05:40:43 发布

您现在位置:Python中文网/ 问答频道 /正文

我需要从网页中提取kmz和zip文件的帮助。下面的代码能够提取表,但不能提取表中的文件和链接。我可以在代码中包含哪些内容,以便输出表也包含链接和文件,而不仅仅是纯文本

网页:

https://www.nhc.noaa.gov/gis/

代码:

import pandas as pd
url = 'https://www.nhc.noaa.gov/gis/'
result = pd.read_html(url)[0]
result

Tags: 文件代码httpsurl网页链接wwwresult
2条回答

您可以使用beautifulsoup获取所有链接

from bs4 import BeautifulSoup
import requests
import pandas as pd
url = 'https://www.nhc.noaa.gov/gis/'

res = requests.get(url)
soup = BeautifulSoup(res.text, "html.parser")
table = soup.find("table")
for anchor in table.find_all("a"):
    print("Text - {}, Link - {}".format(anchor.get_text(strip=True), anchor["href"]))

输出:

Text - Irma Example, Link - /gis/examples/al112017_5day_020.zip
Text - Cone, Link - /gis/examples/AL112017_020adv_CONE.kmz
Text - Track, Link - /gis/examples/AL112017_020adv_TRACK.kmz
Text - Warnings, Link - /gis/examples/AL112017_020adv_WW.kmz
Text - shp, Link - forecast/archive/al092020_5day_latest.zip
Text - Cone, Link - /storm_graphics/api/AL092020_CONE_latest.kmz
Text - Track, Link - /storm_graphics/api/AL092020_TRACK_latest.kmz
Text - Warnings, Link - /storm_graphics/api/AL092020_WW_latest.kmz

如果要保留数据帧,请不要通过read_html再次进行网络调用。重用响应对象

df = pd.read_html(res.text)

要获得完整的链接,请将以下内容附加到所有链接

https://www.nhc.noaa.gov

代码:

for anchor in table.find_all("a"):
    print("Text - {}, Link - {}".format(anchor.get_text(strip=True), prefix + anchor["href"]))

输出:

Text - Irma Example, Link - https://www.nhc.noaa.gov/gis/examples/al112017_5day_020.zip
Text - Cone, Link - https://www.nhc.noaa.gov/gis/examples/AL112017_020adv_CONE.kmz
Text - Track, Link - https://www.nhc.noaa.gov/gis/examples/AL112017_020adv_TRACK.kmz
Text - Warnings, Link - https://www.nhc.noaa.gov/gis/examples/AL112017_020adv_WW.kmz
Text - shp, Link - https://www.nhc.noaa.govforecast/archive/al092020_5day_latest.zip
Text - Cone, Link - https://www.nhc.noaa.gov/storm_graphics/api/AL092020_CONE_latest.kmz
Text - Track, Link - https://www.nhc.noaa.gov/storm_graphics/api/AL092020_TRACK_latest.kmz
Text - Warnings, Link - https://www.nhc.noaa.gov/storm_graphics/api/AL092020_WW_latest.kmz

要下载文件,请再次使用requests并下载文件

我建议使用beautifulsoup(bs4)而不是pandas来解析html

pip install beautifulsoup4 requests

然后就这么简单了

import bs4
import requests

result = bs4.BeautifulSoup(requests.get('https://www.nhc.noaa.gov/gis/').content, features='html.parser')
for link in result.find('table').find_all('a'):
    print(link.attrs['href'])

相关问题 更多 >