Python code for working with NFL play by play data.

Overview

nfl_data_py

nfl_data_py is a Python library for interacting with NFL data sourced from nflfastR, nfldata, dynastyprocess, and Draft Scout.

Includes import functions for play-by-play data, weekly data, seasonal data, rosters, win totals, scoring lines, officials, draft picks, draft pick values, schedules, team descriptive info, combine results and id mappings across various sites.

Installation

Use the package manager pip to install nfl_data_py.

pip install nfl_data_py

Usage

import nfl_data_py as nfl

Working with play-by-play data

nfl.import_pbp_data(years, columns, downcast=True, cache=False, alt_path=None)

Returns play-by-play data for the years and columns specified

years : required, list of years to pull data for (earliest available is 1999)

columns : optional, list of columns to pull data for

downcast : optional, converts float64 columns to float32, reducing memory usage by ~30%. Will slow down initial load speed ~50%

cache : optional, determines whether to pull pbp data from github repo or local cache generated by nfl.cache_pbp()

alt_path : optional, required if nfl.cache_pbp() is called using an alternate path to the default cache

nfl.see_pbp_cols()

returns list of columns available in play-by-play dataset

Working with weekly data

nfl.import_weekly_data(years, columns, downcast)

Returns weekly data for the years and columns specified

years : required, list of years to pull data for (earliest available is 1999)

columns : optional, list of columns to pull data for

downcast : converts float64 columns to float32, reducing memory usage by ~30%. Will slow down initial load speed ~50%

nfl.see_weekly_cols()

returns list of columns available in weekly dataset

Working with seasonal data

nfl.import_seasonal_data(years)

Returns seasonal data, including various calculated market share stats

years : required, list of years to pull data for (earliest available is 1999)

Additional data imports

nfl.import_rosters(years, columns)

Returns roster information for years and columns specified

years : required, list of years to pull data for (earliest available is 1999)

columns : optional, list of columns to pull data for

nfl.import_win_totals(years)

Returns win total lines for years specified

years : optional, list of years to pull

nfl.import_sc_lines(years)

Returns scoring lines for years specified

years : optional, list of years to pull

nfl.import_officials(years)

Returns official information by game for the years specified

years : optional, list of years to pull

nfl.import_draft_picks(years)

Returns list of draft picks for the years specified

years : optional, list of years to pull

nfl.import_draft_values()

Returns relative values by generic draft pick according to various popular valuation methods

nfl.import_team_desc()

Returns dataframe with color/logo/etc information for all NFL team

nfl.import_schedules(years)

Returns dataframe with schedule information for years specified

years : required, list of years to pull data for (earliest available is 1999)

nfl.import_combine_data(years, positions)

Returns dataframe with combine results for years and positions specified

years : optional, list or range of years to pull data from

positions : optional, list of positions to be pulled (standard format - WR/QB/RB/etc.)

nfl.import_ids(columns, ids)

Returns dataframe with mapped ids for all players across most major NFL and fantasy football data platforms

columns : optional, list of columns to return

ids : optional, list of ids to return

nfl.import_ngs_data(stat_type, years)

Returns dataframe with specified NGS data

columns : required, type of data (passing, rushing, receiving)

years : optional, list of years to return data for

nfl.import_depth_charts(years)

Returns dataframe with depth chart data

years : optional, list of years to return data for

nfl.import_injuries(years)

Returns dataframe of injury reports

years : optional, list of years to return data for

nfl.import_qbr(years, level, frequency)

Returns dataframe with QBR history

years : optional, years to return data for

level : optional, competition level to return data for, nfl or college, default nfl

frequency : optional, frequency to return data for, weekly or season, default season

nfl.import_pfr_passing(years)

Returns dataframe of PFR passing data

years : optional, years to return data for

nfl.import_snap_counts(years)

Returns dataframe with snap count records

years : optional, list of years to return data for

Additional features

nfl.cache_pbp(years, downcast=True, alt_path=None)

Caches play-by-play data locally to speed up download time. If years specified have already been cached they will be overwritten, so if using in-season must cache 1x per week to catch most recent data

years : required, list or range of years to cache

downcast : optional, converts float64 columns to float32, reducing memory usage by ~30%. Will slow down initial load speed ~50%

alt_path :optional, alternate path to store pbp cache - default is in program created user Local folder

nfl.clean_nfl_data(df)

Runs descriptive data (team name, player name, etc.) through various cleaning processes

df : required, dataframe to be cleaned

Recognition

I'd like to recognize all of Ben Baldwin, Sebastian Carl, and Lee Sharpe for making this data freely available and easy to access. I'd also like to thank Tan Ho, who has been an invaluable resource as I've worked through this project, and Josh Kazan for the resources and assistance he's provided.

Contributing

Pull requests are welcome. For major changes, please open an issue first to discuss what you would like to change.

License

MIT

Comments
  • Unable to import data

    Unable to import data

    I am trying to run ` import nfl_data_py as nfl

    nfl.import_seasonal_data([2010]) `

    but I get the following error: OverflowError: value too large Exception ignored in: 'fastparquet.cencoding.read_bitpacked'

    I have tried uninstalling and reinstalling fastparquet, which did not work, and the same issue has arisen in my other function calls.

    More specifically: line 530, in read_col part[defi == max_defi] = dic[val]

    IndexError: index 8389024 is out of bounds for axis 0 with size 94518

    opened by samob917 6
  • Caching

    Caching

    I think the library would benefit from having a caching strategy for downloaded or processed files. One option is to download the files outside of pandas and use an http cache. This tends to be temporary, no more than 7 days. Another option is to give the user the option to read from and save to a cache, which could just involve reading/writing parquet files from the existing data directory, which would be more permanent, or to the system tmp directory, which is more transient. Let me know if either option is of interest.

    opened by sansbacon 6
  • read_parquet engine parameter: 'fastparquet' vs. default 'auto' argument

    read_parquet engine parameter: 'fastparquet' vs. default 'auto' argument

    Pandas will try pyarrow and default back to fastparquet if pyarrow is unavailable. Is there a specific reason you are specifying engine='fastparquet' because, if not, I think it would be better to leave it as the default 'auto' argument.

    opened by sansbacon 3
  • ID Tables Small Issue?

    ID Tables Small Issue?

    Hi, i'm a newb and this is my first issue post so please delete if this is wrong. I suspect your code is fine and that the source data has some bugs, but i'm not sure who to alert to help them out.

    I believe there are some slight data issues on the player ID table.

    probably some more than listed below, but i have to run for now. these are pretty small things, doubtful they will be relevant to anything imminent for anyone... i ran into it on a merge i was doing on gsis_id that didn't like my many-to-one relationship because of it. i'm happy to share and/or try to help fix if manual edits are an option (i stink at coding though)

    gsis_id has four duplications: 00-0020270, 00-0019641, 00-0016098, and 00-0029435. On further research, each of these same cases also has a duplication of the pff_id. There are no other pff_id duplications.

    espn_ids have some duplication: 17257, 2578554, 2574009, 5774, 5730, 12771, 2516049, 13490, 2574010, 14660, 2582138, 16094, 17101. i'm not sure the source of all of these - some seem to be extremely similar names but different people and others seem to be typos (e.g. 5774 one of them should simply be 15774).

    yahoo_ids have one dup: 33495. the one from Duke should be 33542

    opened by robertryanharrison 2
  • .import_ngs_data() HTTPError when filtering for year(s)

    .import_ngs_data() HTTPError when filtering for year(s)

    HTTPError: HTTP Error 404: Not Found

    I receive a 404: not found error when I try to import ngs data from a specific year:

    This works just fine and returns the dataframe:

    ngs_data = nfl.import_ngs_data('passing')
    
    ngs_data
    

    When I add a specific year argument, this breaks:

    ngs_data = nfl.import_ngs_data('passing', [2020])
    
    ngs_data
    

    Yes: I realize that nfl.import_ngs_data('passing') is comprehensive of all seasons that are available and I can filter with that.

    I don't know much about these different file formats, but when I look in nflverse-data/releases/nextgen_stats it looks like there are no .parquet files specific to a season (only .qs, .csv.gz, .rds), only the larger ones (sans season) called in the first block: Screen Shot 2022-07-26 at 12 12 50 PM

    So I think something just needs to be changed around here?

    Thanks for putting together this awesome library!! It's been a huge help as I get ready for the season.

    opened by jbf302 2
  • What version of Python is this for? Getting an error

    What version of Python is this for? Getting an error

    I have tried to install it for both Python 3.8 and 3.6 and I am getting this error:

    Collecting pandas>1 (from nfl_data_py) Could not find a version that satisfies the requirement pandas>1 (from nfl_data_py) (from versions: 0.1, 0.2, 0.3.0, 0.4.0, 0.4.1, 0.4.2, 0.4.3, 0.5.0, 0.6.0, 0.6.1, 0.7.0, 0.7.1, 0.7.2, 0.7.3, 0.8.0, 0.8.1, 0.9.0, 0.9.1, 0.10.0, 0.10.1, 0.11.0, 0.12.0, 0.13.0, 0.13.1, 0.14.0, 0.14.1, 0.15.0, 0.15.1, 0.15.2, 0.16.0, 0.16.1, 0.16.2, 0.17.0, 0.17.1, 0.18.0, 0.18.1, 0.19.0, 0.19.1, 0.19.2, 0.20.0, 0.20.1, 0.20.2, 0.20.3, 0.21.0, 0.21.1, 0.22.0, 0.23.0, 0.23.1, 0.23.2, 0.23.3, 0.23.4, 0.24.0, 0.24.1, 0.24.2, 0.25.0, 0.25.1, 0.25.2, 0.25.3)

    opened by epmck 2
  • Reduce memory usage of multi-year pbp dataframe

    Reduce memory usage of multi-year pbp dataframe

    If you load 20 years of data into one dataframe, you could start pushing up against memory limits on certain users' computers. You can reduce memory usage about 30% by converting float64 values to float32 when loading the yearly play-by-play data.

    cols = df.select_dtypes(include=[np.float64]).columns
    df.loc[:, cols] = df.loc[:, cols].astype(np.float32)
    

    On my computer, this reduced the memory usage of a single year from 129.7 MB to 94.5 MB. I don't think the lost precision is going to matter for anything we are doing with football stats.

    If you are interested, I can submit a pull request that implements this change. You could also make it optional with the default being to downcast but allow the user to override if they want np.float64 as the dtype.

    opened by sansbacon 2
  • HTTP Error 404: Not Found

    HTTP Error 404: Not Found

    There's a chance the links have changed in nflfastR. This link doesn't seem to work: data = pandas.read_parquet(r'https://github.com/nflverse/nflfastR-data/raw/master/data/player_stats.parquet', engine='auto')

    Using Python v3.10.7 with nfl_data_py v0.2.5 Call: nfl.import_weekly_data([2021, 2022]) Error: HTTPError Traceback (most recent call last) ~\AppData\Local\Temp/ipykernel_15024/1768614258.py in ----> 1 nfl.import_weekly_data([2021, 2022])

    c:\Users\andre\AppData\Local\Programs\Python\Python310\lib\site-packages\nfl_data_py_init_.py in import_weekly_data(years, columns, downcast) 215 216 # read weekly data --> 217 data = pandas.read_parquet(r'https://github.com/nflverse/nflfastR-data/raw/master/data/player_stats.parquet', engine='auto') 218 data = data[data['season'].isin(years)] 219

    c:\Users\andre\AppData\Local\Programs\Python\Python310\lib\site-packages\pandas\io\parquet.py in read_parquet(path, engine, columns, storage_options, use_nullable_dtypes, **kwargs) 493 impl = get_engine(engine) 494 --> 495 return impl.read( 496 path, 497 columns=columns,

    c:\Users\andre\AppData\Local\Programs\Python\Python310\lib\site-packages\pandas\io\parquet.py in read(self, path, columns, use_nullable_dtypes, storage_options, **kwargs) 230 to_pandas_kwargs["split_blocks"] = True # type: ignore[assignment] 231 --> 232 path_or_handle, handles, kwargs["filesystem"] = _get_path_or_handle( 233 path, 234 kwargs.pop("filesystem", None), ... --> 643 raise HTTPError(req.full_url, code, msg, hdrs, fp) 644 645 class HTTPRedirectHandler(BaseHandler):

    HTTPError: HTTP Error 404: Not Found

    opened by andre03051 1
  • Pandas version incompatibility

    Pandas version incompatibility

    Per discussion in nflverse discord.

    Release 0.2.7 updated Pandas to a version not compatible with Python 3.6. This requirement should be reverted if possible. Alternatively if functionality in the updated Pandas is found to be needed, then the package metadata should be updated to specify it requires Python >= 3.7.

    opened by alecglen 1
  • Pandas .append method deprication

    Pandas .append method deprication

    nfl.import_pbp_data give the warning below. Should update method to use concat rather than append method.

    FutureWarning: The frame.append method is deprecated and will be removed from pandas in a future version. Use pandas.concat instead. plays = plays.append(raw)

    opened by martydertz 1
  • import_pbp_data function not working

    import_pbp_data function not working

    when trying to run the function import_pbp_data, I receive the following error message: No such file or directory: [path]

    Code: import nfl_data_py as nfl data = nfl.import_pbp_data([2021])

    Screen Shot 2021-09-21 at 5 41 50 PM
    opened by Josephhero 1
  • Missing EPA data

    Missing EPA data

    In week 6 of 2021, the Chicago Bears lost 24-14 to the Green Bay Packers: https://www.nfl.com/games/packers-at-bears-2021-reg-6. When trying to load the EPA data for this game, I get an empty dataframe. Could someone let me know if I am doing something wrong?

    import nfl_data_py as nfl
    
    seasons = [2021]
    cols = ['epa', 'week', 'possession_team']
    
    nfl.cache_pbp(seasons, downcast=True, alt_path=None)
    data = nfl.import_pbp_data(seasons, cols, cache=True)
    
    print(data.loc[(data['possession_team'] == 'CHI') & (data['week'] == 5)])
    print(data.loc[(data['possession_team'] == 'CHI') & (data['week'] == 6)])
    
    
    opened by pphili 2
  • Changed downcast code to use df[cols] instead of df.loc[:, cols]

    Changed downcast code to use df[cols] instead of df.loc[:, cols]

    pandas was giving me the following FutureWarning when importing data:

    /nfl_data_py/__init__.py:137: FutureWarning: In a future version, `df.iloc[:, i] = newvals` will attempt
    to set the values inplace instead of always setting a new array. To retain the old behavior, use either
    `df[df.columns[i]] = newvals` or, if columns are non-unique, `df.isetitem(i, newvals)`
        plays.loc[:, cols] = plays.loc[:, cols].astype(numpy.float32)
    

    This change seems to have eliminated the warning.

    opened by wstrausser 0
  • import_win_totals() years not optional

    import_win_totals() years not optional

    Git ReadMe and PyPl docs say years is optional, but years is required. This is a bit difficult because all years are not in the data, so it's hard to know why 2022 doesn't show.

    Suggest making argument years=None and adding the following in the method/lambda: if years is None: years = []

    Let me know how I can contribute. --David

    opened by DavidAllyn68 0
  • Data Dictionary

    Data Dictionary

    Hey guys - thanks for pulling this all together!!! Do you have a data dictionary explaining the columns? Most are self explanatory, but a few are cryptic (e.g. in weekly_data dakota, pacr, racr, wopr, and ..._epa).

    opened by DavidAllyn68 2
  • Missing Personnel Data Week 4 NFL

    Missing Personnel Data Week 4 NFL

    It appears that the values for formation data (offense/defense), players on the play (offense/defense), men in the box, # of pass rushers, etc. are unavailable for 2022 Week 4 data (which is otherwise available). Is this a known issue, an intentional omission, or something else? Where is this data sourced from?

    Wondering if it will be available in the near future or if that is a result of an availability issue

    opened by jwald3 1
Releases(v0.3.0)
Feature Store for Machine Learning

Overview Feast is an open source feature store for machine learning. Feast is the fastest path to productionizing analytic data for model training and

Feast 3.8k Dec 30, 2022
A collection of lecture notes, drawings, flash cards, mind maps, scripts

Neuroanatomy A collection of lecture notes, drawings, flash cards, mind maps, scripts and other helpful resources for the course "Functional Organizat

Georg Reich 3 Sep 21, 2022
Sphinx-performance - CLI tool to measure the build time of different, free configurable Sphinx-Projects

CLI tool to measure the build time of different, free configurable Sphinx-Projec

useblocks 11 Nov 25, 2022
OpenAPI Spec validator

OpenAPI Spec validator About OpenAPI Spec Validator is a Python library that validates OpenAPI Specs against the OpenAPI 2.0 (aka Swagger) and OpenAPI

A 241 Jan 05, 2023
An ongoing curated list of OS X best applications, libraries, frameworks and tools to help developers set up their macOS Laptop.

macOS Development Setup Welcome to MacOS Local Development & Setup. An ongoing curated list of OS X best applications, libraries, frameworks and tools

Paul Veillard 3 Apr 03, 2022
Documentation and issues for Pylance - Fast, feature-rich language support for Python

Documentation and issues for Pylance - Fast, feature-rich language support for Python

Microsoft 1.5k Dec 29, 2022
Build AGNOS, the operating system for your comma three

agnos-builder This is the tool to build AGNOS, our Ubuntu based OS. AGNOS runs on the comma three devkit. NOTE: the edk2_tici and agnos-firmare submod

comma.ai 21 Dec 24, 2022
Watch a Sphinx directory and rebuild the documentation when a change is detected. Also includes a livereload enabled web server.

sphinx-autobuild Rebuild Sphinx documentation on changes, with live-reload in the browser. Installation sphinx-autobuild is available on PyPI. It can

Executable Books 440 Jan 06, 2023
Members: Thomas Longuevergne Program: Network Security Course: 1DV501 Date of submission: 2021-11-02

Mini-project report Members: Thomas Longuevergne Program: Network Security Course: 1DV501 Date of submission: 2021-11-02 Introduction This project was

1 Nov 08, 2021
Contains the assignments from the course Building a Modern Computer from First Principles: From Nand to Tetris.

Contains the assignments from the course Building a Modern Computer from First Principles: From Nand to Tetris.

Matheus Rodrigues 1 Jan 20, 2022
The sarge package provides a wrapper for subprocess which provides command pipeline functionality.

Overview The sarge package provides a wrapper for subprocess which provides command pipeline functionality. This package leverages subprocess to provi

Vinay Sajip 14 Dec 18, 2022
Dev Centric Tools for Mkdocs Based Documentation

docutools MkDocs Documentation Tools For Developers This repo is providing a set of plugins for mkdocs material compatible documentation. It is meant

Axiros GmbH 14 Sep 10, 2022
A curated list of python programming language blogs

Python Blogs A curated list of python programming language blogs Contribute Companies/Organization # A B C D E F G H I J K L M N O P Q R S T U V W X Y

Rizky D. Onto 48 Nov 15, 2022
Reproducible Data Science at Scale!

Pachyderm: The Data Foundation for Machine Learning Pachyderm provides the data layer that allows machine learning teams to productionize and scale th

Pachyderm 5.7k Dec 29, 2022
Loudchecker - Python script to check files for earrape

loudchecker python script to check files for earrape automatically installs depe

1 Jan 22, 2022
Documentation of the QR code found on new Austrian ID cards.

Austrian ID Card QR Code This document aims to be a complete documentation of the format used in the QR area on the back of new Austrian ID cards (Per

Gabriel Huber 9 Dec 12, 2022
This program has been coded to allow the user to rename all the files in the entered folder.

Bulk_File_Renamer This program has been coded to allow the user to rename all the files in the entered folder. The only required package is "termcolor

1 Jan 06, 2022
[Unofficial] Python PEP in EPUB format

PEPs in EPUB format This is a unofficial repository where I stock all valid PEPs in the EPUB format. Repository Cloning git clone --recursive Mickaël Schoentgen 9 Oct 12, 2022

Seamlessly integrate pydantic models in your Sphinx documentation.

Seamlessly integrate pydantic models in your Sphinx documentation.

Franz Wöllert 71 Dec 26, 2022
Pydocstringformatter - A tool to automatically format Python docstrings that tries to follow recommendations from PEP 8 and PEP 257.

Pydocstringformatter A tool to automatically format Python docstrings that tries to follow recommendations from PEP 8 and PEP 257. See What it does fo

Daniël van Noord 31 Dec 29, 2022