淘汰主队

2024-09-28 01:25:25 发布

您现在位置:Python中文网/ 问答频道 /正文

我正在做一个项目,我想搜集2019/20赛季从10月到8月的NBA比赛统计数据

我只关注主队客场球队的比赛结果,而不是球员/球队的具体统计数据,因此我需要使用“基本方框分数统计”表获得每场比赛的方框分数数据

问题:在抓取禁区得分时,我只收集客队的数据,因为这是禁区得分链接中的第一个表,我只需使用索引[0]指定该表(它是静态的)。对于主队来说,表索引似乎会根据是否有随时间变化(OT)而变化,有时还会由于其他未指定的变化(这有点动态)

问题:如何最好地使用循环来收集客场和主队每个月的方块分数?或者,我如何为主队收集每个框得分的数据

一段时间内不带的比赛的方框得分页面示例: https://www.basketball-reference.com/boxscores/201910220LAC.html

随着时间的推移,与进行比赛的框得分页面示例: https://www.basketball-reference.com/boxscores/201910220TOR.html

在后一个示例中,主队的表索引根据前面的表数(包含数据的表,如随时间变化等)而变化。通常是第八张没有加班的桌子,而有加班的桌子则不同

我成功(且一致)获取客场球队数据的代码如下:

box_score_example_url='http://www.basketball-reference.com//boxscores/201910230POR.html'
dfbox[]
for eachBox in box_score_example_url:
    dfz = pd.read_html(eachBox)
    dfbox.append(dfz[0])
    
boxbox_awayteam = pd.concat(dfbox)
boxbox_awayteam

我没有这个想法,因为在HTML代码中似乎没有任何表具有特定的id或类。这是我的第一个网页抓取项目,也是我在Stackoverflow上提出的第一个问题,我对此一无所知


Tags: 数据com示例htmlwww时间分数reference
1条回答
网友
1楼 · 发布于 2024-09-28 01:25:25

您可以使用BeautifulSoup和CSS选择器[id$="-game-basic"] table仅选择两个基本表,然后使用pd.read_html()加载这些表:

import requests
import pandas as pd
from bs4 import BeautifulSoup


url = 'https://www.basketball-reference.com/boxscores/201910220TOR.html'
soup = BeautifulSoup(requests.get(url).content, 'html.parser')

my_tables = soup.select('[id$="-game-basic"] table')

df_1 = pd.read_html(str(my_tables[0]))[0].droplevel(0, axis=1)
df_2 = pd.read_html(str(my_tables[1]))[0].droplevel(0, axis=1)

print(df_1)
print(df_2)

印刷品:

                    Starters            MP  ...           PTS           +/-
0               Jrue Holiday         41:05  ...            13           -14
1             Brandon Ingram         35:06  ...            22           -19
2                J.J. Redick         27:03  ...            16           -14
3                 Lonzo Ball         24:50  ...             8            -7
4             Derrick Favors         20:46  ...             6           -12
5                   Reserves            MP  ...           PTS           +/-
6                  Josh Hart         28:10  ...            15            -1
7               Nicolò Melli         19:37  ...            14           +11
8           Kenrich Williams         18:02  ...             3           +11
9              Frank Jackson         13:51  ...             9            +7
10             Jahlil Okafor         12:29  ...             8            -7
11             E'Twaun Moore         12:06  ...             5            -1
12  Nickeil Alexander-Walker         11:55  ...             3            +6
13              Jaxson Hayes  Did Not Play  ...  Did Not Play  Did Not Play
14               Team Totals           265  ...           122           NaN

[15 rows x 21 columns]
           Starters            MP  ...           PTS           +/-
0        Kyle Lowry         44:59  ...            22            -1
1     Fred VanVleet         44:21  ...            34           +18
2     Pascal Siakam         38:09  ...            34            +5
3        OG Anunoby         35:48  ...            11           +12
4        Marc Gasol         31:55  ...             6            -2
5          Reserves            MP  ...           PTS           +/-
6     Norman Powell         28:38  ...             5            +2
7       Serge Ibaka         26:00  ...            13            +6
8     Terence Davis         15:10  ...             5             0
9       Matt Thomas  Did Not Play  ...  Did Not Play  Did Not Play
10    Chris Boucher  Did Not Play  ...  Did Not Play  Did Not Play
11  Stanley Johnson  Did Not Play  ...  Did Not Play  Did Not Play
12   Malcolm Miller  Did Not Play  ...  Did Not Play  Did Not Play
13  Dewan Hernandez  Did Not Play  ...  Did Not Play  Did Not Play
14      Team Totals           265  ...           130           NaN

[15 rows x 21 columns]

编辑:要将此函数放入循环中,可以使用以下示例:

import requests
import pandas as pd
from bs4 import BeautifulSoup

url = 'https://www.basketball-reference.com/leagues/NBA_2020_games.html'
soup = BeautifulSoup(requests.get(url).content, 'html.parser')

def get_tables(url):
    soup = BeautifulSoup(requests.get(url).content, 'html.parser')

    my_tables = soup.select('[id$="-game-basic"] table')

    df_1 = pd.read_html(str(my_tables[0]))[0].droplevel(0, axis=1)
    df_2 = pd.read_html(str(my_tables[1]))[0].droplevel(0, axis=1)

    return df_1, df_2

for a in soup.select('.filter a'):
    u = 'https://www.basketball-reference.com' + a['href']
    print(u)
    soup2 = BeautifulSoup(requests.get(u).content, 'html.parser')
    for a2 in soup2.select('td a[href^="/boxscores/"]'):
        u2 = 'https://www.basketball-reference.com' + a2['href']
        t1, t2 = get_tables(u2)
        print(u2)
        print(t1)
        print(t2)
        print('-' * 80)

印刷品:

https://www.basketball-reference.com/leagues/NBA_2020_games-october.html
https://www.basketball-reference.com/boxscores/201910220TOR.html
                    Starters            MP  ...           PTS           +/-
0               Jrue Holiday         41:05  ...            13           -14
1             Brandon Ingram         35:06  ...            22           -19
2                J.J. Redick         27:03  ...            16           -14
3                 Lonzo Ball         24:50  ...             8            -7
4             Derrick Favors         20:46  ...             6           -12
5                   Reserves            MP  ...           PTS           +/-
6                  Josh Hart         28:10  ...            15            -1
7               Nicolò Melli         19:37  ...            14           +11
8           Kenrich Williams         18:02  ...             3           +11
9              Frank Jackson         13:51  ...             9            +7
10             Jahlil Okafor         12:29  ...             8            -7
11             E'Twaun Moore         12:06  ...             5            -1
12  Nickeil Alexander-Walker         11:55  ...             3            +6
13              Jaxson Hayes  Did Not Play  ...  Did Not Play  Did Not Play
14               Team Totals           265  ...           122           NaN

[15 rows x 21 columns]
           Starters            MP  ...           PTS           +/-
0        Kyle Lowry         44:59  ...            22            -1
1     Fred VanVleet         44:21  ...            34           +18
2     Pascal Siakam         38:09  ...            34            +5
3        OG Anunoby         35:48  ...            11           +12
4        Marc Gasol         31:55  ...             6            -2
5          Reserves            MP  ...           PTS           +/-
6     Norman Powell         28:38  ...             5            +2
7       Serge Ibaka         26:00  ...            13            +6
8     Terence Davis         15:10  ...             5             0
9       Matt Thomas  Did Not Play  ...  Did Not Play  Did Not Play
10    Chris Boucher  Did Not Play  ...  Did Not Play  Did Not Play
11  Stanley Johnson  Did Not Play  ...  Did Not Play  Did Not Play
12   Malcolm Miller  Did Not Play  ...  Did Not Play  Did Not Play
13  Dewan Hernandez  Did Not Play  ...  Did Not Play  Did Not Play
14      Team Totals           265  ...           130           NaN

[15 rows x 21 columns]
                                        
https://www.basketball-reference.com/boxscores/201910220LAC.html
                    Starters            MP  ...           PTS           +/-
0              Anthony Davis         37:22  ...            25            +3
1               LeBron James         36:00  ...            18            -8
2                Danny Green         32:20  ...            28            +7


...and so on.

相关问题 更多 >

    热门问题