这个软件包的目的是让人们从各自的api和网站上获取全国曲棍球联盟(nhl)和全国女子曲棍球联盟(nwhl)的原始数据。
hockey-scraper的Python项目详细描述
曲棍球刮板
目的
这个软件包的目的是让人们能够同时收集nhl和nwhl的数据。对于nhl来说,一场戏一场戏 并将所有季前赛、常规赛和季后赛的国家曲棍球联盟(NHL)API和网站数据转移 从2007-2008赛季开始的比赛。对于nwhl来说,用户可以从他们的api和网站上获取一个一个的播放数据。 自2015-2016赛季以来的所有季前赛、常规赛和季后赛。
先决条件
为此,您需要安装python。这应该对Python2.7和3都有效(我建议 至少是3.6.0版,但早期版本应该可以)。
如果您的机器上没有安装python,我建议您通过anaconda distribution安装它。Anaconda预装了一堆库,这样启动起来更容易。
安装
要安装,只需打开终端并键入:
pip install hockey_scraper
nhl使用
标准刮片功能
按季度级别收集数据:
import hockey_scraper # Scrapes the 2015 & 2016 season with shifts and stores the data in a Csv file hockey_scraper.scrape_seasons([2015, 2016], True) # Scrapes the 2008 season without shifts and returns a dictionary containing the pbp Pandas DataFrame scraped_data = hockey_scraper.scrape_seasons([2008], False, data_format='Pandas')
收集游戏列表:
import hockey_scraper # Scrapes the first game of 2014, 2015, and 2016 seasons with shifts and stores the data in a Csv file hockey_scraper.scrape_games([2014020001, 2015020001, 2016020001], True) # Scrapes the first game of 2007, 2008, and 2009 seasons with shifts and returns a Dictionary with the Pandas DataFrames scraped_data = hockey_scraper.scrape_games([2007020001, 2008020001, 2009020001], True, data_format='Pandas')
删除给定日期范围内的所有游戏:
import hockey_scraper # Scrapes all games between 2016-10-10 and 2016-10-20 without shifts and stores the data in a Csv file hockey_scraper.scrape_date_range('2016-10-10', '2016-10-20', False) # Scrapes all games between 2015-1-1 and 2015-1-15 without shifts and returns a Dictionary with the pbp Pandas DataFrame scraped_data = hockey_scraper.scrape_date_range('2015-1-1', '2015-1-15', False, data_format='Pandas')
将默认参数“data_format”设置为“pandas”返回的字典的结构如下:
{ # Both of these are always included 'pbp': pbp_df, 'errors': scraping_errors, # This is only included when the argument 'if_scrape_shifts' is set equal to True 'shifts': shifts_df }
如果需要的话,也可以将擦掉的文件保存在单独的目录中。这使得我们可以更快地重新抓取游戏 不需要找回它们。这是通过将关键字参数“docs\u dir”指定为true来完成的 在主目录中创建、存储和查找。或者你可以提供你自己的目录 存储(它必须预先存在)。
import hockey_scraper # Create or try to refer to a directory in the home repository # Will create a directory called 'hockey_scraper_data' in the home directory (if it doesn't exist) hockey_scraper.scrape_seasons([2015, 2016], True, docs_dir=True) # Path to the given directory USER_PATH = "/...." # Scrapes the 2015 & 2016 season with shifts and stores the data in a Csv file # Also includes a path for an existing directory for the scraped files to be placed in or retrieved from. hockey_scraper.scrape_seasons([2015, 2016], True, docs_dir=USER_PATH) # Once could chose to re-scrape previously saved files by making the keyword argument rescrape=True hockey_scraper.scrape_seasons([2015, 2016], True, docs_dir=USER_PATH, rescrape=True)
实时刮削
这里有一个简单的例子,一种方法来设置实时刮削。我强烈建议退房 this section如果你打算使用这个文档。
import hockey_scraper as hs def to_csv(game): """ Store each game DataFrame in a file :param game: LiveGame object :return: None """ # If the game: # 1. Started - We recorded at least one event # 2. Not in Intermission # 3. Not Over if game.is_ongoing(): # Get both DataFrames pbp_df = game.get_pbp() shifts_df = game.get_shifts() # Print the description of the last event print(game.game_id, "->", pbp_df.iloc[-1]['Description']) # Store in CSV files pbp_df.to_csv(f"../hockey_scraper_data/{game.game_id}_pbp.csv", sep=',') shifts_df.to_csv(f"../hockey_scraper_data/{game.game_id}_shifts.csv", sep=',') if __name__ == "__main__": # B4 we start set the directory to store the files # You don't have to do this but I recommend it hs.live_scrape.set_docs_dir("../hockey_scraper_data") # Scrape the info for all the games on 2018-11-15 games = hs.ScrapeLiveGames("2018-11-15", if_scrape_shifts=True, pause=20) # While all the games aren't finished while not games.finished(): # Update for all the games currently being played games.update_live_games(sleep_next=True) # Go through every LiveGame object and apply some function # You can of course do whatever you want here. for game in games.live_games: to_csv(game)
NWHL使用
按季度级别收集数据:
import hockey_scraper # Scrapes the 2015 & 2016 season and stores the data in a Csv file hockey_scraper.nwhl.scrape_seasons([2015, 2016]) # Scrapes the 2008 season and returns a Pandas DataFrame containing the pbp scraped_data = hockey_scraper.nwhl.scrape_seasons([2017], data_format='Pandas')
收集游戏列表:
import hockey_scraper # Scrape some games and store the results in a Csv file # Also saves the scraped pages hockey_scraper.nwhl.scrape_games([14694271, 14814946, 14689491], docs_dir="...Path you specified")
删除给定日期范围内的所有游戏:
import hockey_scraper # Scrapes all games between 2016-10-10 and 2017-01-01 and returns a Pandas DataFrame containing the pbp hockey_scraper.nwhl.scrape_date_range('2016-10-10', '2017-01-01', data_format='pandas')
完整的文档可以在here中找到。
接触
如有任何问题或建议,请与我联系。对于任何错误或任何与代码相关的东西,请打开一个问题。 否则你可以发邮件给我Harryshomer@gmail.com。