ValueError:在随机抽取大量数据时,找不到与正则表达式“+”匹配的表

2024-09-30 18:18:02 发布

您现在位置:Python中文网/ 问答频道 /正文

这是我与熊猫和硒的第一个项目,所以我可能犯了一个愚蠢的错误。我编写这个函数是为了浏览nba球员的列表,并将他们的比赛日志刮到数据框中。这一切都很好,但偶尔当我浏览玩家列表时,它会在某个随机点停止工作,并给我这个错误

Traceback (most recent call last):
  File "/Users/arslanamir/PycharmProjects/nba/main.py", line 154, in <module>
    Game_Log_Scraper(players, x)
  File "/Users/arslanamir/PycharmProjects/nba/main.py", line 48, in Game_Log_Scraper
    tables = pd.read_html(html, flavor='lxml')
  File "/Users/arslanamir/PycharmProjects/nba/venv/lib/python3.9/site-packages/pandas/util/_decorators.py", line 299, in wrapper
    return func(*args, **kwargs)
  File "/Users/arslanamir/PycharmProjects/nba/venv/lib/python3.9/site-packages/pandas/io/html.py", line 1085, in read_html
    return _parse(
  File "/Users/arslanamir/PycharmProjects/nba/venv/lib/python3.9/site-packages/pandas/io/html.py", line 913, in _parse
    raise retained
  File "/Users/arslanamir/PycharmProjects/nba/venv/lib/python3.9/site-packages/pandas/io/html.py", line 893, in _parse
    tables = p.parse_tables()
  File "/Users/arslanamir/PycharmProjects/nba/venv/lib/python3.9/site-packages/pandas/io/html.py", line 213, in parse_tables
    tables = self._parse_tables(self._build_doc(), self.match, self.attrs)
  File "/Users/arslanamir/PycharmProjects/nba/venv/lib/python3.9/site-packages/pandas/io/html.py", line 684, in _parse_tables
    raise ValueError(f"No tables found matching regex {repr(pattern)}")
ValueError: No tables found matching regex '.+'

Process finished with exit code 1

这就是函数

def Game_Log_Scraper(players):
    for name in players:
        first = name.split()[0]
        last = name.split()[1]
        if not Path(f'/Users/arslanamir/PycharmProjects/nba/{first} {last}').is_file():
            driver = webdriver.Chrome(executable_path='/Users/arslanamir/PycharmProjects/chromedriver')
            driver.get(f'https://www.nba.com/stats/players/boxscores/?CF=PLAYER_NAME*E*{first}%20{last}&Season=2020-21'
                       f'&SeasonType=Regular%20Season')
            html = driver.page_source

            tables = pd.read_html(html, flavor='lxml')
            data = tables[1]

            driver.close()

            not_needed = ['Match\xa0Up', 'Season', 'FGM', 'FGA', '3PM', '3PA', '3P%', 'FTM', 'FTA',
                          'FT%', 'STL', 'BLK', 'TOV', '+/-', 'FP', 'FG%', 'OREB', 'DREB', 'PF']

            for item in not_needed:
                data.drop(item, axis=1, inplace=True)

            data.dropna(axis=0, inplace=True)
            data.drop('W/L', axis=1, inplace=True)

            with open(f'{first} {last}', 'w+') as f:
                f.write(data.to_string())

    return players

我也尝试过将read_html风格更改为html5lib和bs4,但都不起作用。这是网页的一个例子,https://www.nba.com/stats/players/boxscores/?CF=PLAYER_NAMEEMalik%20Beasley&Season=2020-21&SeasonType=Regular%20Season


Tags: inpytablesvenvparselibhtmlline
1条回答
网友
1楼 · 发布于 2024-09-30 18:18:02

两件事一拍即合:

  1. 不需要通过循环列来删除它们。你可以用这个列表

所以改变

for item in not_needed:
    data.drop(item, axis=1, inplace=True)

data.drop(not_needed, axis=1, inplace=True)
  1. 您没有对函数中的players列表执行任何操作,因此实际上不需要返回该列表,或者其他任何操作。该函数所做的只是检查文件是否已经存在,如果不存在则写入

  2. Selenium在这方面做得太过火了(它会减慢您必须通过浏览器进行处理的过程)。nba统计api可以在一个请求中获取所有赛季和球员数据。然后只需过滤该表,而不是通过浏览器进行过滤

  3. 为了从api中过滤该表/数据,我们需要与您提供的播放器名称和数据中的内容完全匹配。它也区分大小写。因此,为了解释拼写错误,玩家列表中的不同名称与数据(即Glenn Robinson不会从表中返回任何内容,因为它是'Glenn Robinson III'。因此,我在其中添加了一个额外的进程,使用名为fuzzywuzzy的包。请确保pip install fuzzywuzzy使其正常工作

  4. 我没有对你的代码做更多的操作,但是请记住,如果你需要更新你的文件(因此如果你今天运行它,然后下周运行它),你的文件中不会包含任何新游戏,因为你只是检查文件是否存在,而不是文件是否是最新的

代码:

import requests
import pandas as pd
from pathlib import Path

# pip install fuzzywuzzy
from fuzzywuzzy import process

def get_data():
    url = 'https://stats.nba.com/stats/leaguegamelog'
    headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.150 Safari/537.36',
    'Referer': 'http://stats.nba.com'}
    
    payload = {
    'Counter': '1000',
    'DateFrom': '',
    'DateTo': '',
    'Direction': 'DESC',
    'LeagueID': '00',
    'PlayerOrTeam': 'P',
    'Season': '2020-21',
    'SeasonType': 'Regular Season',
    'Sorter': 'DATE'}
    
    jsonData = requests.get(url, headers=headers, params=payload).json()
    
    cols = jsonData['resultSets'][0]['headers']
    data = jsonData['resultSets'][0]['rowSet']
    df = pd.DataFrame(data, columns=cols)
    return df



def Game_Log_Scraper(players):
    data = get_data()
    for name in players:
        
        # Use fuzzywuzzy to match player name 
        choices = list(data['PLAYER_NAME'].unique())
        player = process.extractOne('{}'.format(name), choices)[0]
        
        #if not Path(f'/Users/arslanamir/PycharmProjects/nba/{first} {last}').is_file():
        if not Path(f'/Users/arslanamir/PycharmProjects/nba/{player}.csv').is_file():
            player_df = data[data['PLAYER_NAME'] == player]
            
            not_needed = ['MATCHUP', 'SEASON_ID', 'FGM', 'FGA', 'FG3M', 'FG3A', 
                          'FG3_PCT', 'FTM', 'FTA', 'WL', 'FT_PCT', 'STL', 'BLK', 
                          'TOV', 'PLUS_MINUS', 'FANTASY_PTS', 'FG_PCT', 'OREB', 
                          'DREB', 'PF', 'FANTASY_PTS', 'VIDEO_AVAILABLE']

            player_df.drop(not_needed, axis=1, inplace=True)
            player_df.dropna(axis=0, inplace=True)
            
            player_df.to_csv(f'/Users/arslanamir/PycharmProjects/nba/{player}.csv', index=False)
            print (f'{player} file saved.')
            
        else:
            print(f'{player} file already present.')

    
players = ['Zach LaVine', 'ZaCk LeViNE', 'LeBron James', 'Labron james Jr.', 'le brn jame Jr.']
Game_Log_Scraper(players)

输出:

Zach LaVine file saved.
Zach LaVine file already present.
LeBron James file saved.
LeBron James file already present.
LeBron James file already present.

勒布朗·詹姆斯.csv

enter image description here

相关问题 更多 >