我正在从以下URL中删除多个赛季的国家曲棍球联盟(NHL)数据:
https://www.hockey-reference.com/leagues/NHL_2018_skaters.html
我在这里只得到了几个实例,并尝试在for循环中移动dict语句。我也尝试过利用我在其他帖子上找到的解决方案,但运气不佳。感谢您的帮助。谢谢大家!
import requests
from bs4 import BeautifulSoup
import pandas as pd
dict={}
for i in range (2010,2020):
year = str(i)
source = requests.get('https://www.hockey-reference.com/leagues/NHL_'+year+'_skaters.html').text
soup = BeautifulSoup(source,features='lxml')
#identifying table in html
table = soup.find('table', id="stats")
#grabbing <tr> tags in html
rows = table.findAll("tr")
#creating passable values for each "stat" in td tag
data_stats = [
"player",
"age",
"team_id",
"pos",
"games_played",
"goals",
"assists",
"points",
"plus_minus",
"pen_min",
"ps",
"goals_ev",
"goals_pp",
"goals_sh",
"goals_gw",
"assists_ev",
"assists_pp",
"assists_sh",
"shots",
"shot_pct",
"time_on_ice",
"time_on_ice_avg",
"blocks",
"hits",
"faceoff_wins",
"faceoff_losses",
"faceoff_percentage"
]
for rownum in rows:
# grabbing player name and using as key
filter = { "data-stat":'player' }
cell = rows[3].findAll("td",filter)
nameval = cell[0].string
list = []
for data in data_stats:
#iterating through data_stat to grab values
filter = { "data-stat":data }
cell = rows[3].findAll("td",filter)
value = cell[0].string
list.append(value)
dict[nameval] = list
dict[nameval].append(year)
# conversion to numeric values and creating dataframe
columns = [
"player",
"age",
"team_id",
"pos",
"games_played",
"goals",
"assists",
"points",
"plus_minus",
"pen_min",
"ps",
"goals_ev",
"goals_pp",
"goals_sh",
"goals_gw",
"assists_ev",
"assists_pp",
"assists_sh",
"shots",
"shot_pct",
"time_on_ice",
"time_on_ice_avg",
"blocks",
"hits",
"faceoff_wins",
"faceoff_losses",
"faceoff_percentage",
"year"
]
df = pd.DataFrame.from_dict(dict,orient='index',columns=columns)
cols = df.columns.drop(['player','team_id','pos','year'])
df[cols] = df[cols].apply(pd.to_numeric, errors='coerce')
print(df)
输出
Craig Adams Craig Adams 32 ... 43.9 2010
Luke Adam Luke Adam 22 ... 100.0 2013
Justin Abdelkader Justin Abdelkader 29 ... 29.4 2017
Will Acton Will Acton 27 ... 50.0 2015
Noel Acciari Noel Acciari 24 ... 44.1 2016
Pontus Aberg Pontus Aberg 25 ... 10.5 2019
[6 rows x 28 columns]
用漂亮的汤刮去格式繁复的桌子肯定是痛苦的(不要对漂亮的汤大惊小怪,这对几个用例来说都是美妙的)。如果您愿意对此有一点实用性的话,我会使用一种“黑客”来抓取被密集标记包围的数据:
输入输出
这并不完美。CSV中会有空行,但删除空行比刮取此类数据更容易,耗时也少得多。祝你好运
作为参考,我在下面链接了我自己的转换
我只需要使用pandas
.read_html()
,它为您解析表做了大量的工作(在引擎盖下使用beautifulsou)代码:
然后可以使用
result.to_csv('filename.csv',index=False)
保存到文件输出:
相关问题 更多 >
编程相关推荐