Python code for working with NFL play by play data.

Last update: Jan 05, 2023

Related tags

Overview

nfl_data_py

nfl_data_py is a Python library for interacting with NFL data sourced from nflfastR, nfldata, dynastyprocess, and Draft Scout.

Includes import functions for play-by-play data, weekly data, seasonal data, rosters, win totals, scoring lines, officials, draft picks, draft pick values, schedules, team descriptive info, combine results and id mappings across various sites.

Installation

Use the package manager pip to install nfl_data_py.

pip install nfl_data_py

Usage

import nfl_data_py as nfl

Working with play-by-play data

nfl.import_pbp_data(years, columns, downcast=True, cache=False, alt_path=None)

Returns play-by-play data for the years and columns specified

years : required, list of years to pull data for (earliest available is 1999)

columns : optional, list of columns to pull data for

downcast : optional, converts float64 columns to float32, reducing memory usage by ~30%. Will slow down initial load speed ~50%

cache : optional, determines whether to pull pbp data from github repo or local cache generated by nfl.cache_pbp()

alt_path : optional, required if nfl.cache_pbp() is called using an alternate path to the default cache

nfl.see_pbp_cols()

returns list of columns available in play-by-play dataset

Working with weekly data

nfl.import_weekly_data(years, columns, downcast)

Returns weekly data for the years and columns specified

years : required, list of years to pull data for (earliest available is 1999)

columns : optional, list of columns to pull data for

downcast : converts float64 columns to float32, reducing memory usage by ~30%. Will slow down initial load speed ~50%

nfl.see_weekly_cols()

returns list of columns available in weekly dataset

Working with seasonal data

nfl.import_seasonal_data(years)

Returns seasonal data, including various calculated market share stats

years : required, list of years to pull data for (earliest available is 1999)

Additional data imports

nfl.import_rosters(years, columns)

Returns roster information for years and columns specified

years : required, list of years to pull data for (earliest available is 1999)

columns : optional, list of columns to pull data for

nfl.import_win_totals(years)

Returns win total lines for years specified

years : optional, list of years to pull

nfl.import_sc_lines(years)

Returns scoring lines for years specified

years : optional, list of years to pull

nfl.import_officials(years)

Returns official information by game for the years specified

years : optional, list of years to pull

nfl.import_draft_picks(years)

Returns list of draft picks for the years specified

years : optional, list of years to pull

nfl.import_draft_values()

Returns relative values by generic draft pick according to various popular valuation methods

nfl.import_team_desc()

Returns dataframe with color/logo/etc information for all NFL team

nfl.import_schedules(years)

Returns dataframe with schedule information for years specified

years : required, list of years to pull data for (earliest available is 1999)

nfl.import_combine_data(years, positions)

Returns dataframe with combine results for years and positions specified

years : optional, list or range of years to pull data from

positions : optional, list of positions to be pulled (standard format - WR/QB/RB/etc.)

nfl.import_ids(columns, ids)

Returns dataframe with mapped ids for all players across most major NFL and fantasy football data platforms

columns : optional, list of columns to return

ids : optional, list of ids to return

nfl.import_ngs_data(stat_type, years)

Returns dataframe with specified NGS data

columns : required, type of data (passing, rushing, receiving)

years : optional, list of years to return data for

nfl.import_depth_charts(years)

Returns dataframe with depth chart data

years : optional, list of years to return data for

nfl.import_injuries(years)

Returns dataframe of injury reports

years : optional, list of years to return data for

nfl.import_qbr(years, level, frequency)

Returns dataframe with QBR history

years : optional, years to return data for

level : optional, competition level to return data for, nfl or college, default nfl

frequency : optional, frequency to return data for, weekly or season, default season

nfl.import_pfr_passing(years)

Returns dataframe of PFR passing data

years : optional, years to return data for

nfl.import_snap_counts(years)

Returns dataframe with snap count records

years : optional, list of years to return data for

Additional features

nfl.cache_pbp(years, downcast=True, alt_path=None)

Caches play-by-play data locally to speed up download time. If years specified have already been cached they will be overwritten, so if using in-season must cache 1x per week to catch most recent data

years : required, list or range of years to cache

downcast : optional, converts float64 columns to float32, reducing memory usage by ~30%. Will slow down initial load speed ~50%

alt_path :optional, alternate path to store pbp cache - default is in program created user Local folder

nfl.clean_nfl_data(df)

Runs descriptive data (team name, player name, etc.) through various cleaning processes

df : required, dataframe to be cleaned

Recognition

I'd like to recognize all of Ben Baldwin, Sebastian Carl, and Lee Sharpe for making this data freely available and easy to access. I'd also like to thank Tan Ho, who has been an invaluable resource as I've worked through this project, and Josh Kazan for the resources and assistance he's provided.

Contributing

Pull requests are welcome. For major changes, please open an issue first to discuss what you would like to change.

License

MIT

Comments

Unable to import data

I am trying to run ` import nfl_data_py as nfl

nfl.import_seasonal_data([2010]) `

but I get the following error: OverflowError: value too large Exception ignored in: 'fastparquet.cencoding.read_bitpacked'

I have tried uninstalling and reinstalling fastparquet, which did not work, and the same issue has arisen in my other function calls.

More specifically: line 530, in read_col part[defi == max_defi] = dic[val]

IndexError: index 8389024 is out of bounds for axis 0 with size 94518

opened by samob917 6
Caching

I think the library would benefit from having a caching strategy for downloaded or processed files. One option is to download the files outside of pandas and use an http cache. This tends to be temporary, no more than 7 days. Another option is to give the user the option to read from and save to a cache, which could just involve reading/writing parquet files from the existing data directory, which would be more permanent, or to the system tmp directory, which is more transient. Let me know if either option is of interest.

opened by sansbacon 6
read_parquet engine parameter: 'fastparquet' vs. default 'auto' argument

Pandas will try pyarrow and default back to fastparquet if pyarrow is unavailable. Is there a specific reason you are specifying engine='fastparquet' because, if not, I think it would be better to leave it as the default 'auto' argument.

opened by sansbacon 3
ID Tables Small Issue?
Hi, i'm a newb and this is my first issue post so please delete if this is wrong. I suspect your code is fine and that the source data has some bugs, but i'm not sure who to alert to help them out.

I believe there are some slight data issues on the player ID table.

probably some more than listed below, but i have to run for now. these are pretty small things, doubtful they will be relevant to anything imminent for anyone... i ran into it on a merge i was doing on gsis_id that didn't like my many-to-one relationship because of it. i'm happy to share and/or try to help fix if manual edits are an option (i stink at coding though)

gsis_id has four duplications: 00-0020270, 00-0019641, 00-0016098, and 00-0029435. On further research, each of these same cases also has a duplication of the pff_id. There are no other pff_id duplications.

espn_ids have some duplication: 17257, 2578554, 2574009, 5774, 5730, 12771, 2516049, 13490, 2574010, 14660, 2582138, 16094, 17101. i'm not sure the source of all of these - some seem to be extremely similar names but different people and others seem to be typos (e.g. 5774 one of them should simply be 15774).

yahoo_ids have one dup: 33495. the one from Duke should be 33542
opened by robertryanharrison 2
.import_ngs_data() HTTPError when filtering for year(s)
HTTPError: HTTP Error 404: Not Found

I receive a 404: not found error when I try to import ngs data from a specific year:

This works just fine and returns the dataframe:

ngs_data = nfl.import_ngs_data('passing') ngs_data

When I add a specific year argument, this breaks:

ngs_data = nfl.import_ngs_data('passing', [2020]) ngs_data

Yes: I realize that nfl.import_ngs_data('passing') is comprehensive of all seasons that are available and I can filter with that.

I don't know much about these different file formats, but when I look in nflverse-data/releases/nextgen_stats it looks like there are no .parquet files specific to a season (only .qs, .csv.gz, .rds), only the larger ones (sans season) called in the first block:

So I think something just needs to be changed around here?

Thanks for putting together this awesome library!! It's been a huge help as I get ready for the season.
opened by jbf302 2
What version of Python is this for? Getting an error

I have tried to install it for both Python 3.8 and 3.6 and I am getting this error:

Collecting pandas>1 (from nfl_data_py) Could not find a version that satisfies the requirement pandas>1 (from nfl_data_py) (from versions: 0.1, 0.2, 0.3.0, 0.4.0, 0.4.1, 0.4.2, 0.4.3, 0.5.0, 0.6.0, 0.6.1, 0.7.0, 0.7.1, 0.7.2, 0.7.3, 0.8.0, 0.8.1, 0.9.0, 0.9.1, 0.10.0, 0.10.1, 0.11.0, 0.12.0, 0.13.0, 0.13.1, 0.14.0, 0.14.1, 0.15.0, 0.15.1, 0.15.2, 0.16.0, 0.16.1, 0.16.2, 0.17.0, 0.17.1, 0.18.0, 0.18.1, 0.19.0, 0.19.1, 0.19.2, 0.20.0, 0.20.1, 0.20.2, 0.20.3, 0.21.0, 0.21.1, 0.22.0, 0.23.0, 0.23.1, 0.23.2, 0.23.3, 0.23.4, 0.24.0, 0.24.1, 0.24.2, 0.25.0, 0.25.1, 0.25.2, 0.25.3)

opened by epmck 2
Reduce memory usage of multi-year pbp dataframe
If you load 20 years of data into one dataframe, you could start pushing up against memory limits on certain users' computers. You can reduce memory usage about 30% by converting float64 values to float32 when loading the yearly play-by-play data.

cols = df.select_dtypes(include=[np.float64]).columns df.loc[:, cols] = df.loc[:, cols].astype(np.float32)

On my computer, this reduced the memory usage of a single year from 129.7 MB to 94.5 MB. I don't think the lost precision is going to matter for anything we are doing with football stats.

If you are interested, I can submit a pull request that implements this change. You could also make it optional with the default being to downcast but allow the user to override if they want np.float64 as the dtype.
opened by sansbacon 2
HTTP Error 404: Not Found

There's a chance the links have changed in nflfastR. This link doesn't seem to work: data = pandas.read_parquet(r'https://github.com/nflverse/nflfastR-data/raw/master/data/player_stats.parquet', engine='auto')

Using Python v3.10.7 with nfl_data_py v0.2.5 Call: nfl.import_weekly_data([2021, 2022]) Error: HTTPError Traceback (most recent call last) ~\AppData\Local\Temp/ipykernel_15024/1768614258.py in ----> 1 nfl.import_weekly_data([2021, 2022])

c:\Users\andre\AppData\Local\Programs\Python\Python310\lib\site-packages\nfl_data_py_init_.py in import_weekly_data(years, columns, downcast) 215 216 # read weekly data --> 217 data = pandas.read_parquet(r'https://github.com/nflverse/nflfastR-data/raw/master/data/player_stats.parquet', engine='auto') 218 data = data[data['season'].isin(years)] 219

c:\Users\andre\AppData\Local\Programs\Python\Python310\lib\site-packages\pandas\io\parquet.py in read_parquet(path, engine, columns, storage_options, use_nullable_dtypes, **kwargs) 493 impl = get_engine(engine) 494 --> 495 return impl.read( 496 path, 497 columns=columns,

c:\Users\andre\AppData\Local\Programs\Python\Python310\lib\site-packages\pandas\io\parquet.py in read(self, path, columns, use_nullable_dtypes, storage_options, **kwargs) 230 to_pandas_kwargs["split_blocks"] = True # type: ignore[assignment] 231 --> 232 path_or_handle, handles, kwargs["filesystem"] = _get_path_or_handle( 233 path, 234 kwargs.pop("filesystem", None), ... --> 643 raise HTTPError(req.full_url, code, msg, hdrs, fp) 644 645 class HTTPRedirectHandler(BaseHandler):

HTTPError: HTTP Error 404: Not Found

opened by andre03051 1
Pandas version incompatibility

Per discussion in nflverse discord.

Release 0.2.7 updated Pandas to a version not compatible with Python 3.6. This requirement should be reverted if possible. Alternatively if functionality in the updated Pandas is found to be needed, then the package metadata should be updated to specify it requires Python >= 3.7.

opened by alecglen 1
Pandas .append method deprication

nfl.import_pbp_data give the warning below. Should update method to use concat rather than append method.

FutureWarning: The frame.append method is deprecated and will be removed from pandas in a future version. Use pandas.concat instead. plays = plays.append(raw)

opened by martydertz 1
import_pbp_data function not working

when trying to run the function import_pbp_data, I receive the following error message: No such file or directory: [path]

Code: import nfl_data_py as nfl data = nfl.import_pbp_data([2021])

opened by Josephhero 1

Missing EPA data

In week 6 of 2021, the Chicago Bears lost 24-14 to the Green Bay Packers: https://www.nfl.com/games/packers-at-bears-2021-reg-6. When trying to load the EPA data for this game, I get an empty dataframe. Could someone let me know if I am doing something wrong?

import nfl_data_py as nfl

seasons = [2021]
cols = ['epa', 'week', 'possession_team']

nfl.cache_pbp(seasons, downcast=True, alt_path=None)
data = nfl.import_pbp_data(seasons, cols, cache=True)

print(data.loc[(data['possession_team'] == 'CHI') & (data['week'] == 5)])
print(data.loc[(data['possession_team'] == 'CHI') & (data['week'] == 6)])

opened by pphili 2

Changed downcast code to use df[cols] instead of df.loc[:, cols]

pandas was giving me the following FutureWarning when importing data:

/nfl_data_py/__init__.py:137: FutureWarning: In a future version, `df.iloc[:, i] = newvals` will attempt
to set the values inplace instead of always setting a new array. To retain the old behavior, use either
`df[df.columns[i]] = newvals` or, if columns are non-unique, `df.isetitem(i, newvals)`
    plays.loc[:, cols] = plays.loc[:, cols].astype(numpy.float32)

This change seems to have eliminated the warning.

opened by wstrausser 0

import_win_totals() years not optional

Git ReadMe and PyPl docs say years is optional, but years is required. This is a bit difficult because all years are not in the data, so it's hard to know why 2022 doesn't show.

Suggest making argument years=None and adding the following in the method/lambda: if years is None: years = []

Let me know how I can contribute. --David

opened by DavidAllyn68 0
Data Dictionary

Hey guys - thanks for pulling this all together!!! Do you have a data dictionary explaining the columns? Most are self explanatory, but a few are cryptic (e.g. in weekly_data dakota, pacr, racr, wopr, and ..._epa).

opened by DavidAllyn68 2
Missing Personnel Data Week 4 NFL

It appears that the values for formation data (offense/defense), players on the play (offense/defense), men in the box, # of pass rushers, etc. are unavailable for 2022 Week 4 data (which is otherwise available). Is this a known issue, an intentional omission, or something else? Where is this data sourced from?

Wondering if it will be available in the near future or if that is a result of an availability issue

opened by jwald3 1

Releases(v0.3.0)

v0.3.0(Aug 20, 2022)

Added import functionality for participation, contract, officials, and player data made previously available through nflReadR
Source code(tar.gz)
Source code(zip)
v0.2.11(Aug 20, 2022)
Actually fixed issue between python and pandas not resolved in 0.2.9

Dropped python 3.5 support from nfl_data_py to allow for parquet file usage

Fixed position filtering for combine data

Source code(tar.gz)
Source code(zip)
v0.2.8(Jul 30, 2022)

Fixed deprecation warning for import_pbp, adjust import_ngs to handle years correctly.
Source code(tar.gz)
Source code(zip)
v0.2.7(Jun 4, 2022)

-Now getting data from updated data sources -Fixed bug that was impeding the import_weekly_data() function
Source code(tar.gz)
Source code(zip)
v0.2.6(Mar 15, 2022)

-Cache functionality should work on all systems -PFR passing and snaps data redirected to new data location
Source code(tar.gz)
Source code(zip)
v0.2.4(Aug 28, 2021)

Added functionality for caching pbp data locally to speed up load process by 4-5x.
Source code(tar.gz)
Source code(zip)
v0.2.0(Aug 11, 2021)
New release includes functions for pulling:

NGS data

snap counts

depth charts

injury reports

PFR passing stats

And cleaned up repo.
Source code(tar.gz)
Source code(zip)
0.1.6(Aug 7, 2021)

-data imports can now use either pyarrow or fastparquet -import_schedules() directs to new file with more data -clean_df() includes feature for replace 'NA' with np.nan
Source code(tar.gz)
Source code(zip)
v0.1.5(Aug 2, 2021)

Source code(tar.gz)
Source code(zip)
v0.1.4(Aug 2, 2021)

Added default downcasting of float64s to float32 to reduce memory usage. Will slow down initial data load. Can be turned off by setting downcast=False in import_weekly_data() or import_pbp_data().
Source code(tar.gz)
Source code(zip)
v0.1.3(Jul 31, 2021)

Added pulls for combine data and mapping table with ids for a variety of NFL/fantasy sites
Source code(tar.gz)
Source code(zip)
v0.1.2(Jul 29, 2021)

Source code(tar.gz)
Source code(zip)
v0.1.0(Jul 29, 2021)

Added functions for pulling betting lines, officials data, and draft pick data.
Source code(tar.gz)
Source code(zip)
v0.0.5(Jul 27, 2021)

Added error checking + functions to pull schedule and team descriptive data.
Source code(tar.gz)
Source code(zip)
v0.0.4(Jul 26, 2021)

First public version nfl_data_py
Source code(tar.gz)
Source code(zip)

Owner

GitHub Repository

Official Matplotlib cheat sheets

6.7k Jan 09, 2023

This program has been coded to allow the user to rename all the files in the entered folder.

Bulk_File_Renamer This program has been coded to allow the user to rename all the files in the entered folder. The only required package is "termcolor

1 Jan 06, 2022

A powerful Sphinx changelog-generating extension.

What is Releases? Releases is a Python (2.7, 3.4+) compatible Sphinx (1.8+) extension designed to help you keep a source control friendly, merge frien

166 Dec 29, 2022

30 Days of google cloud leaderboard website

30 Days of Cloud Leaderboard This is a leaderboard for the students of Thapar, Patiala who are participating in the 2021 30 days of Google Cloud Platf

13 Aug 25, 2022

Swagger UI is a collection of HTML, JavaScript, and CSS assets that dynamically generate beautiful documentation from a Swagger-compliant API.

Introduction Swagger UI allows anyone — be it your development team or your end consumers — to visualize and interact with the API’s resources without

23.2k Dec 29, 2022

Python code for working with NFL play by play data.

Related tags

Overview

nfl_data_py

Installation

Usage

Recognition

Contributing

License

Comments

Releases(v0.3.0)

v0.3.0(Aug 20, 2022)

v0.2.11(Aug 20, 2022)

v0.2.8(Jul 30, 2022)

v0.2.7(Jun 4, 2022)

v0.2.6(Mar 15, 2022)

v0.2.4(Aug 28, 2021)

v0.2.0(Aug 11, 2021)

0.1.6(Aug 7, 2021)

v0.1.5(Aug 2, 2021)

v0.1.4(Aug 2, 2021)

v0.1.3(Jul 31, 2021)

v0.1.2(Jul 29, 2021)

v0.1.0(Jul 29, 2021)

v0.0.5(Jul 27, 2021)

v0.0.4(Jul 26, 2021)

Owner

Official Matplotlib cheat sheets

This program has been coded to allow the user to rename all the files in the entered folder.

A powerful Sphinx changelog-generating extension.

30 Days of google cloud leaderboard website

Swagger UI is a collection of HTML, JavaScript, and CSS assets that dynamically generate beautiful documentation from a Swagger-compliant API.

Anomaly Detection via Reverse Distillation from One-Class Embedding

Project documentation with Markdown.

Simple yet powerful CAD (Computer Aided Design) library, written with Python.

Showing potential issues with merge strategies

level2-data-annotation_cv-level2-cv-15 created by GitHub Classroom

The tutorial is a collection of many other resources and my own notes

k3heap is a binary min heap implemented with reference

Sphinx theme for readthedocs.org

A Material Design theme for MkDocs

A plugin to introduce a generic API for Decompiler support in GEF

Markdown documentation generator from Google docstrings

SqlAlchemy Flask-Restful Swagger Json:API OpenAPI

Some custom tweaks to the results produced by pytkdocs.

Loudchecker - Python script to check files for earrape

Projeto em Python colaborativo para o Bootcamp de Dados do Itaú em parceria com a Lets Code