basketballreference中的Webscraping数据

2024-05-20 19:22:44 发布

您现在位置:Python中文网/ 问答频道 /正文

我试图从一个网站上收集一些数据,但在将数据过滤到一组结果中时遇到了问题

我想要一个包含2018-19赛季所有先进数据的DF

import requests
from bs4 import BeautifulSoup
import pandas as pd

headers = {'User-Agent': 
           'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/47.0.2526.106 Safari/537.36'}

page = "https://www.basketball-reference.com/players/c/curryst01.html"
pageTree = requests.get(page, headers=headers)
pageSoup = BeautifulSoup(pageTree.content, 'html.parser')

dados_agrupados = pageSoup.find_all("div", {"id": "all_advanced"}, recursive=True)

print(dados_agrupados)

如您所见,dados_agrupados对象包含完整的历史数据和一些其他信息

我如何进一步过滤这些数据以获得2018-19赛季的统计数据


Tags: 数据fromimportdf网站htmlpageall
1条回答
网友
1楼 · 发布于 2024-05-20 19:22:44

要获取advanced stats表,需要将其从html注释(它所在的位置)中提取出来。我不知道你想要所有"all advanced stats from the 2018-19 season."是什么意思

这里只有一个表包含id="all_advanced"和该季节的一行。如果你的意思是你想去那个链接,拉那个表,那是另一回事。但你不是很清楚

因此,这里要拉取该表,然后过滤该季节/行:

import requests
from bs4 import BeautifulSoup, Comment
import pandas as pd

headers = {'User-Agent': 
           'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/47.0.2526.106 Safari/537.36'}

page = "https://www.basketball-reference.com/players/c/curryst01.html"
pageTree = requests.get(page, headers=headers)
pageSoup = BeautifulSoup(pageTree.content, 'html.parser')
comments = pageSoup.find_all(string=lambda text: isinstance(text, Comment))

tables = []
for each in comments:
    if 'table' in each:
        try:
            tables.append(pd.read_html(each, attrs = {'id': 'advanced'})[0])
        except:
            continue

df = tables[0]
df_filter = df[df['Season'] == '2018-19'] 

输出:

print (df.to_string())
     Season   Age   Tm   Lg  Pos    G     MP   PER    TS%   3PAr    FTr  ORB%  DRB%  TRB%  AST%  STL%  BLK%  TOV%  USG%  Unnamed: 19   OWS   DWS     WS  WS/48  Unnamed: 24  OBPM  DBPM   BPM  VORP
0   2009-10  21.0  GSW  NBA   PG   80   2896  16.3  0.568  0.332  0.175   1.8  12.0   6.8  24.6   2.5   0.5  16.5  21.8          NaN   3.0   1.6    4.7  0.077          NaN   1.1  -0.5   0.7   2.0
1   2010-11  22.0  GSW  NBA   PG   74   2489  19.4  0.595  0.325  0.216   2.3  10.9   6.5  28.1   2.2   0.6  16.4  24.4          NaN   5.4   1.3    6.6  0.128          NaN   3.0  -0.7   2.3   2.7
2   2011-12  23.0  GSW  NBA   PG   26    732  21.2  0.605  0.409  0.159   2.3  11.3   6.8  32.3   2.8   0.8  17.0  24.0          NaN   1.8   0.4    2.2  0.144          NaN   4.1   0.3   4.3   1.2
3   2012-13  24.0  GSW  NBA   PG   78   2983  21.3  0.589  0.432  0.210   2.3   9.1   5.8  31.1   2.1   0.3  13.7  26.4          NaN   8.4   2.8   11.2  0.180          NaN   5.3   0.1   5.4   5.6
4   2013-14  25.0  GSW  NBA   PG   78   2846  24.1  0.610  0.445  0.252   1.8  10.9   6.4  39.9   2.2   0.4  16.1  28.3          NaN   9.3   4.0   13.4  0.225          NaN   6.3   1.1   7.4   6.7
5   2014-15  26.0  GSW  NBA   PG   80   2613  28.0  0.638  0.482  0.251   2.4  11.4   7.0  38.6   3.0   0.5  14.3  28.9          NaN  11.5   4.1   15.7  0.288          NaN   8.2   1.7   9.9   7.9
6   2015-16  27.0  GSW  NBA   PG   79   2700  31.5  0.669  0.554  0.250   2.9  13.6   8.6  33.7   3.0   0.4  12.9  32.6          NaN  13.8   4.1   17.9  0.318          NaN  10.3   1.6  11.9   9.5
7   2016-17  28.0  GSW  NBA   PG   79   2638  24.6  0.624  0.547  0.251   2.7  11.4   7.3  31.2   2.6   0.5  13.0  30.1          NaN   8.7   3.9   12.6  0.229          NaN   6.7   0.3   6.9   5.9
8   2017-18  29.0  GSW  NBA   PG   51   1631  28.2  0.675  0.580  0.350   2.7  14.4   9.0  30.3   2.4   0.4  13.3  31.0          NaN   7.2   1.9    9.1  0.267          NaN   7.8   0.0   7.7   4.0
9   2018-19  30.0  GSW  NBA   PG   69   2331  24.4  0.641  0.604  0.214   2.2  14.2   8.4  24.2   1.9   0.9  11.6  30.4          NaN   7.2   2.5    9.7  0.199          NaN   7.1  -0.5   6.6   5.1
10  2019-20  31.0  GSW  NBA   PG    5    139  21.7  0.557  0.598  0.317   3.0  17.8  10.1  42.3   1.7   1.3  14.6  33.6          NaN   0.2   0.1    0.3  0.104          NaN   4.5  -0.6   3.9   0.2
11   Career   NaN  NaN  NBA  NaN  699  23998  23.8  0.623  0.481  0.237   2.3  11.8   7.2  31.5   2.5   0.5  14.2  27.9          NaN  76.5  26.7  103.2  0.207          NaN   6.0   0.4   6.4  50.7

和过滤器:

print (df_filter.to_string())
    Season   Age   Tm   Lg Pos   G    MP   PER    TS%   3PAr    FTr  ORB%  DRB%  TRB%  AST%  STL%  BLK%  TOV%  USG%  Unnamed: 19  OWS  DWS   WS  WS/48  Unnamed: 24  OBPM  DBPM  BPM  VORP
9  2018-19  30.0  GSW  NBA  PG  69  2331  24.4  0.641  0.604  0.214   2.2  14.2   8.4  24.2   1.9   0.9  11.6  30.4          NaN  7.2  2.5  9.7  0.199          NaN   7.1  -0.5  6.6   5.1

相关问题 更多 >