使用beautifulsoup从span标记中刮取数据

http_url = "https://en.tutiempo.net/climate/01-2013/ws-432950.html" retreived_data = requests.get(http_url).text soup = BeautifulSoup(retreived_data, "lxml") climate_table = soup.find("table", attrs={"class": "medias mensuales numspan"}) climate_data = climate_table.find_all("tr") for data in climate_data[1:-2]: table_data = data.find_all("td") row_data = [] for row in table_data: row_data.append(row.get_text()) climate_df.loc[len(climate_df)] = row_data

1条回答

网友
1楼 · 发布于 2024-10-02 10:26:57

误解了您的问题，因为您引用了两个不同的URL。我现在明白你的意思了
是的，奇怪的是，在第二个表中，他们使用CSS填充了一些<td>标记的内容。您需要做的是从<style>标记中取出这些特殊情况。一旦有了这些元素，就可以在html源代码中替换这些元素，并最终将其解析为数据帧。我使用pandas，因为它在引擎盖下使用beautifulsou解析<table>标记。但我相信这会让你得到你想要的：
import pandas as pd import requests from bs4 import BeautifulSoup import re http_url = "https://en.tutiempo.net/climate/01-2013/ws-432950.html" retreived_data = requests.get(http_url).text soup = BeautifulSoup(retreived_data, "lxml") hiddenData = str(soup.find_all('style')[1]) hiddenSpan = {} for group in re.findall(r'span\.(.+?)}',hiddenData): class_attr = group.split('span.')[-1].split('::')[0] content = group.split('"')[1] hiddenSpan[class_attr] = content climate_table = str(soup.find("table", attrs={"class": "medias mensuales numspan"})) for k, v in hiddenSpan.items(): climate_table = climate_table.replace('<span class="%s"></span>' %(k), hiddenSpan[k]) df = pd.read_html(climate_table)[0]
输出：
print (df.to_string()) Day T TM Tm SLP H PP VV V VM VG RA SN TS FG 0 1 23.4 30.3 19 - 59 0 6.3 4.3 5.4 - NaN NaN NaN NaN 1 2 22.4 30.3 16.9 - 57 0 6.9 3.3 7.6 - NaN NaN NaN NaN 2 3 24 31.8 16.9 - 51 0 6.9 2.8 5.4 - NaN NaN NaN NaN 3 4 24.2 32 17.4 - 53 0 6 3.3 5.4 - NaN NaN NaN NaN 4 5 23.8 32 18 - 58 0 6.9 3.1 7.6 - NaN NaN NaN NaN 5 6 23.3 31 18.3 - 60 0 6.9 5 9.4 - NaN NaN NaN NaN 6 7 22.8 30.2 17.6 - 55 0 7.7 3.7 7.6 - NaN NaN NaN NaN 7 8 23.1 30.6 17.4 - 46 0 6.9 3.3 5.4 - NaN NaN NaN NaN 8 9 22.9 30.6 17.4 - 51 0 6.9 3.5 3.5 - NaN NaN NaN NaN 9 10 22.3 30 17 - 56 0 6.3 3.3 7.6 - NaN NaN NaN NaN 10 11 22.3 29.4 17 - 53 0 6.9 4.3 7.6 - NaN NaN NaN NaN 11 12 21.8 29.4 15.7 - 54 0 6.9 2.8 3.5 - NaN NaN NaN NaN 12 13 22.3 30.1 15.7 - 43 0 6.9 2.8 5.4 - NaN NaN NaN NaN 13 14 21.8 30.6 14.8 - 41 0 6.9 1.9 5.4 - NaN NaN NaN NaN 14 15 21.6 30.6 14.2 - 43 0 6.9 3.1 7.6 - NaN NaN NaN NaN 15 16 21.1 29.9 15.4 - 55 0 6.9 4.1 7.6 - NaN NaN NaN NaN 16 17 20.4 28.1 15.4 - 59 0 6.9 5 11.1 - NaN NaN NaN NaN 17 18 21.2 28.3 14.5 - 53 0 6.9 3.1 7.6 - NaN NaN NaN NaN 18 19 21.6 29.6 16.4 - 58 0 6.9 2.2 3.5 - NaN NaN NaN NaN 19 20 21.9 29.6 16.6 - 58 0 6.9 2.4 5.4 - NaN NaN NaN NaN 20 21 22.3 29.9 17.5 - 55 0 6.9 3.1 5.4 - NaN NaN NaN NaN 21 22 21.9 29.9 15.1 - 46 0 6.9 4.3 7.6 - NaN NaN NaN NaN 22 23 21.3 29 15.2 - 50 0 6.9 3.3 5.4 - NaN NaN NaN NaN 23 24 21.3 28.8 14.6 - 45 0 6.9 3 5.4 - NaN NaN NaN NaN 24 25 21.6 29.1 15.5 - 47 0 7.7 4.8 7.6 - NaN NaN NaN NaN 25 26 21.8 29.2 14.6 - 41 0 6.9 2.8 3.5 - NaN NaN NaN NaN 26 27 22.3 30.1 15.6 - 40 0 6.9 2.4 5.4 - NaN NaN NaN NaN 27 28 22.4 30.3 16 - 51 0 6.9 2.8 3.5 - NaN NaN NaN NaN 28 29 23 30.3 16.9 - 53 0 6.6 2.8 5.4 - NaN NaN NaN o 29 30 23.1 30 17.8 - 54 0 6.9 5.4 7.6 - NaN NaN NaN NaN 30 31 22.1 29.8 17.3 - 54 0 6.9 5.2 9.4 - NaN NaN NaN NaN 31 Monthly means and totals: Monthly means and totals: Monthly means and totals: Monthly means and totals: Monthly means and totals: Monthly means and totals: Monthly means and totals: Monthly means and totals: Monthly means and totals: Monthly means and totals: Monthly means and totals: Monthly means and totals: Monthly means and totals: Monthly means and totals: Monthly means and totals: 32 NaN 22.3 30 16.4 - 51.6 0 6.9 3.5 6.3 NaN 0 0 0 1

相关问题更多 >

编程相关推荐

热门问题

热门文章