这个软件包的目的是让人们从各自的api和网站上获取全国曲棍球联盟(nhl)和全国女子曲棍球联盟(nwhl)的原始数据。

hockey-scraper的Python项目详细描述


https://badge.fury.io/py/hockey-scraper.svgDocumentation Status

曲棍球刮板

目的

这个软件包的目的是让人们能够同时收集nhl和nwhl的数据。对于nhl来说,一场戏一场戏 并将所有季前赛、常规赛和季后赛的国家曲棍球联盟(NHL)API和网站数据转移 从2007-2008赛季开始的比赛。对于nwhl来说,用户可以从他们的api和网站上获取一个一个的播放数据。 自2015-2016赛季以来的所有季前赛、常规赛和季后赛。

先决条件

为此,您需要安装python。这应该对Python2.7和3都有效(我建议 至少是3.6.0版,但早期版本应该可以)。

如果您的机器上没有安装python,我建议您通过anaconda distribution安装它。Anaconda预装了一堆库,这样启动起来更容易。

安装

要安装,只需打开终端并键入:

pip install hockey_scraper

nhl使用

标准刮片功能

按季度级别收集数据:

import hockey_scraper

# Scrapes the 2015 & 2016 season with shifts and stores the data in a Csv file
hockey_scraper.scrape_seasons([2015, 2016], True)

# Scrapes the 2008 season without shifts and returns a dictionary containing the pbp Pandas DataFrame
scraped_data = hockey_scraper.scrape_seasons([2008], False, data_format='Pandas')

收集游戏列表:

import hockey_scraper

# Scrapes the first game of 2014, 2015, and 2016 seasons with shifts and stores the data in a Csv file
hockey_scraper.scrape_games([2014020001, 2015020001, 2016020001], True)

# Scrapes the first game of 2007, 2008, and 2009 seasons with shifts and returns a Dictionary with the Pandas DataFrames
scraped_data = hockey_scraper.scrape_games([2007020001, 2008020001, 2009020001], True, data_format='Pandas')

删除给定日期范围内的所有游戏:

import hockey_scraper

# Scrapes all games between 2016-10-10 and 2016-10-20 without shifts and stores the data in a Csv file
hockey_scraper.scrape_date_range('2016-10-10', '2016-10-20', False)

# Scrapes all games between 2015-1-1 and 2015-1-15 without shifts and returns a Dictionary with the pbp Pandas DataFrame
scraped_data = hockey_scraper.scrape_date_range('2015-1-1', '2015-1-15', False, data_format='Pandas')

将默认参数“data_format”设置为“pandas”返回的字典的结构如下:

{
  # Both of these are always included
  'pbp': pbp_df,
  'errors': scraping_errors,

  # This is only included when the argument 'if_scrape_shifts' is set equal to True
  'shifts': shifts_df
}

如果需要的话,也可以将擦掉的文件保存在单独的目录中。这使得我们可以更快地重新抓取游戏 不需要找回它们。这是通过将关键字参数“docs\u dir”指定为true来完成的 在主目录中创建、存储和查找。或者你可以提供你自己的目录 存储(它必须预先存在)。

import hockey_scraper

# Create or try to refer to a directory in the home repository
# Will create a directory called 'hockey_scraper_data' in the home directory (if it doesn't exist)
hockey_scraper.scrape_seasons([2015, 2016], True, docs_dir=True)

# Path to the given directory
USER_PATH = "/...."

# Scrapes the 2015 & 2016 season with shifts and stores the data in a Csv file
# Also includes a path for an existing directory for the scraped files to be placed in or retrieved from.
hockey_scraper.scrape_seasons([2015, 2016], True, docs_dir=USER_PATH)

# Once could chose to re-scrape previously saved files by making the keyword argument rescrape=True
hockey_scraper.scrape_seasons([2015, 2016], True, docs_dir=USER_PATH, rescrape=True)

实时刮削

这里有一个简单的例子,一种方法来设置实时刮削。我强烈建议退房 this section如果你打算使用这个文档。

import hockey_scraper as hs


def to_csv(game):
    """
    Store each game DataFrame in a file

    :param game: LiveGame object

    :return: None
    """

    # If the game:
    # 1. Started - We recorded at least one event
    # 2. Not in Intermission
    # 3. Not Over
    if game.is_ongoing():
        # Get both DataFrames
        pbp_df = game.get_pbp()
        shifts_df = game.get_shifts()

        # Print the description of the last event
        print(game.game_id, "->", pbp_df.iloc[-1]['Description'])

        # Store in CSV files
        pbp_df.to_csv(f"../hockey_scraper_data/{game.game_id}_pbp.csv", sep=',')
        shifts_df.to_csv(f"../hockey_scraper_data/{game.game_id}_shifts.csv", sep=',')

if __name__ == "__main__":
    # B4 we start set the directory to store the files
    # You don't have to do this but I recommend it
    hs.live_scrape.set_docs_dir("../hockey_scraper_data")

    # Scrape the info for all the games on 2018-11-15
    games = hs.ScrapeLiveGames("2018-11-15", if_scrape_shifts=True, pause=20)

    # While all the games aren't finished
    while not games.finished():
        # Update for all the games currently being played
        games.update_live_games(sleep_next=True)

        # Go through every LiveGame object and apply some function
        # You can of course do whatever you want here.
        for game in games.live_games:
            to_csv(game)

NWHL使用

按季度级别收集数据:

import hockey_scraper

# Scrapes the 2015 & 2016 season and stores the data in a Csv file
hockey_scraper.nwhl.scrape_seasons([2015, 2016])

# Scrapes the 2008 season and returns a Pandas DataFrame containing the pbp
scraped_data = hockey_scraper.nwhl.scrape_seasons([2017], data_format='Pandas')

收集游戏列表:

import hockey_scraper

# Scrape some games and store the results in a Csv file
# Also saves the scraped pages
hockey_scraper.nwhl.scrape_games([14694271, 14814946, 14689491], docs_dir="...Path you specified")

删除给定日期范围内的所有游戏:

import hockey_scraper

# Scrapes all games between 2016-10-10 and 2017-01-01 and returns a Pandas DataFrame containing the pbp
hockey_scraper.nwhl.scrape_date_range('2016-10-10', '2017-01-01', data_format='pandas')

完整的文档可以在here中找到。

接触

如有任何问题或建议,请与我联系。对于任何错误或任何与代码相关的东西,请打开一个问题。 否则你可以发邮件给我Harryshomer@gmail.com

欢迎加入QQ群-->: 979659372 Python中文网_新手群

推荐PyPI第三方库


热门话题
java是否可以将一半的文本绘制成不同的颜色?   java如何在Eclipse中生成Javadoc HTML文件?   带有时间戳的java“select”preparedStatement返回始终为空的记录集   java将无符号类型写入Netty ChannelBuffer   java动态资源名称   java lookupDefaultPrintService()不返回系统默认打印机   java如何在播放模板中翻译#{get'title'/}?   java为什么JSR352的ItemWriter接口中有一个checkpointInfo?有任何示例实现吗?   在方法外部声明的Java引用变量存在于堆栈或堆上   java类、异常、用户输入   java JApplet通过Eclipse而不是web浏览器连接到本地主机Mysql   java soap metro转储