Scalable machine learning based time series forecasting

Overview

mlforecast

Scalable machine learning based time series forecasting.

CI Lint Python PyPi conda-forge codecov License

Install

PyPI

pip install mlforecast

Optional dependencies

If you want more functionality you can instead use pip install mlforecast[extra1,extra2,...]. The current extra dependencies are:

  • aws: adds the functionality to use S3 as the storage in the CLI.
  • cli: includes the validations necessary to use the CLI.
  • distributed: installs dask to perform distributed training. Note that you'll also need to install either LightGBM or XGBoost.

For example, if you want to perform distributed training through the CLI using S3 as your storage you'll need all three extras, which you can get using: pip install mlforecast[aws,cli,distributed].

conda-forge

conda install -c conda-forge mlforecast

Note that this installation comes with the required dependencies for the local interface. If you want to:

  • Use s3 as storage: conda install -c conda-forge s3path
  • Perform distributed training: conda install -c conda-forge dask and either LightGBM or XGBoost.

How to use

The following provides a very basic overview, for a more detailed description see the documentation.

Programmatic API

Store your time series in a pandas dataframe with an index named unique_id that identifies each time serie, a column ds that contains the datestamps and a column y with the values.

from mlforecast.utils import generate_daily_series

series = generate_daily_series(20)
display_df(series.head())
unique_id ds y
id_00 2000-01-01 00:00:00 0.264447
id_00 2000-01-02 00:00:00 1.28402
id_00 2000-01-03 00:00:00 2.4628
id_00 2000-01-04 00:00:00 3.03552
id_00 2000-01-05 00:00:00 4.04356

Then create a TimeSeries object with the features that you want to use. These include lags, transformations on the lags and date features. The lag transformations are defined as numba jitted functions that transform an array, if they have additional arguments you supply a tuple (transform_func, arg1, arg2, ...).

from mlforecast.core import TimeSeries
from window_ops.expanding import expanding_mean
from window_ops.rolling import rolling_mean

ts = TimeSeries(
    lags=[7, 14],
    lag_transforms={
        1: [expanding_mean],
        7: [(rolling_mean, 7), (rolling_mean, 14)]
    },
    date_features=['dayofweek', 'month']
)
ts
TimeSeries(freq=<Day>, transforms=['lag-7', 'lag-14', 'expanding_mean_lag-1', 'rolling_mean_lag-7_window_size-7', 'rolling_mean_lag-7_window_size-14'], date_features=['dayofweek', 'month'], num_threads=8)

Next define a model. If you want to use the local interface this can be any regressor that follows the scikit-learn API. For distributed training there are LGBMForecast and XGBForecast.

from sklearn.ensemble import RandomForestRegressor

model = RandomForestRegressor(random_state=0)

Now instantiate your forecast object with the model and the time series. There are two types of forecasters, Forecast which is local and DistributedForecast which performs the whole process in a distributed way.

from mlforecast.forecast import Forecast

fcst = Forecast(model, ts)

To compute the features and train the model using them call .fit on your Forecast object.

fcst.fit(series)
Forecast(model=RandomForestRegressor(random_state=0), ts=TimeSeries(freq=<Day>, transforms=['lag-7', 'lag-14', 'expanding_mean_lag-1', 'rolling_mean_lag-7_window_size-7', 'rolling_mean_lag-7_window_size-14'], date_features=['dayofweek', 'month'], num_threads=8))

To get the forecasts for the next 14 days call .predict(14) on the forecaster. This will update the target with each prediction and recompute the features to get the next one.

predictions = fcst.predict(14)

display_df(predictions.head())
unique_id ds y_pred
id_00 2000-08-10 00:00:00 5.24484
id_00 2000-08-11 00:00:00 6.25861
id_00 2000-08-12 00:00:00 0.225484
id_00 2000-08-13 00:00:00 1.22896
id_00 2000-08-14 00:00:00 2.30246

CLI

If you're looking for computing quick baselines, want to avoid some boilerplate or just like using CLIs better then you can use the mlforecast binary with a configuration file like the following:

!cat sample_configs/local.yaml
data:
  prefix: data
  input: train
  output: outputs
  format: parquet
features:
  freq: D
  lags: [7, 14]
  lag_transforms:
    1: 
    - expanding_mean
    7: 
    - rolling_mean:
        window_size: 7
    - rolling_mean:
        window_size: 14
  date_features: ["dayofweek", "month", "year"]
  num_threads: 2
backtest:
  n_windows: 2
  window_size: 7
forecast:
  horizon: 7
local:
  model:
    name: sklearn.ensemble.RandomForestRegressor
    params:
      n_estimators: 10
      max_depth: 7

The configuration is validated using FlowConfig.

This configuration will use the data in data.prefix/data.input to train and write the results to data.prefix/data.output both with data.format.

data_path = Path('data')
data_path.mkdir()
series.to_parquet(data_path/'train')
!mlforecast sample_configs/local.yaml
Split 1 MSE: 0.0251
Split 2 MSE: 0.0180
list((data_path/'outputs').iterdir())
[PosixPath('data/outputs/valid_1.parquet'),
 PosixPath('data/outputs/valid_0.parquet'),
 PosixPath('data/outputs/forecast.parquet')]
Comments
  • mlforecast for multivariate time series analysis

    mlforecast for multivariate time series analysis

    Hello,

    I want to use "mlforecast" library for my Multivariate Time Series problem and I want to know how could I add new features, like holidays or temperature, to the dataset besides 'lags' and 'date_features'. Below is flow configuration:

    `fcst = Forecast(
        models=model,
        freq='W-MON',
        lags=[1,2,3,4,5,6,7,8],
        date_features=['month', 'week']
    )
    `
    

    Is there a way to add exogenous variables to the training process? I could not find relevant information to be able to do this.

    Thank you!

    opened by MariaBocsa 5
  • What's the purpose of using scale_factor?

    What's the purpose of using scale_factor?

    I noticed in the docs under the "Custom predictions" section it references using a scale_factor - I'm just wondering what the purpose of this would be?

    Is it the same purpose as the alpha here?: https://www.kaggle.com/code/lemuz90/m5-mlforecast/notebook

    I'm assuming that it's some kind of post prediction adjustment to improve accuracy but I'm keen to hear the thought process behind it.

    opened by TPreece101 5
  • [FEAT] Add step size argument to cross validation method

    [FEAT] Add step size argument to cross validation method

    Description

    This PR adds the step_size argument to the cross validation method. The argument controls the size between each cross validation window.

    Checklist:

    • [x] This PR has a meaningful title and a clear description.
    • [x] The tests pass.
    • [x] All linting tasks pass.
    • [x] The notebooks are clean.
    feature 
    opened by FedericoGarza 4
  • [FIX] delete cla.yml

    [FIX] delete cla.yml

    Description

    CLA agreement will now be handled by https://cla-assistant.io/ Checklist:

    • [ ] This PR has a meaningful title and a clear description.
    • [ ] The tests pass.
    • [ ] All linting tasks pass.
    • [ ] The notebooks are clean.
    opened by FedericoGarza 3
  • add MLForecast.from_cv

    add MLForecast.from_cv

    Description

    Removes the fit_on_all argument from LightGBMCV and introduces a constructor MLForecast.from_cv that builds the forecast object from a trained cv with the best iteration, features and parameters from cv object. Also makes some small changes to keep the structure of the input dataframe, which are:

    • If the id is not the index the predict method from all forecasts returns it as a column (previously it was always the index)
    • The cv_preds_ argument of LightGBMCV had id and time as a multiindex, now they have the same structure as the input df.

    Checklist:

    • [x] This PR has a meaningful title and a clear description.
    • [x] The tests pass.
    • [x] All linting tasks pass.
    • [x] The notebooks are clean.
    breaking 
    opened by jmoralez 2
  • remove dashes from feature names

    remove dashes from feature names

    Description

    Removes dashes from feature names, e.g. lag-7 becomes lag7.

    Checklist:

    • [ ] This PR has a meaningful title and a clear description.
    • [ ] The tests pass.
    • [ ] All linting tasks pass.
    • [ ] The notebooks are clean.
    breaking 
    opened by jmoralez 2
  • Unable to import Forecast from mlforecast

    Unable to import Forecast from mlforecast

    Description

    Unable to import Forecast from mlforecast

    Reproducible example

    # code goes here
    from mlforecast import Forecast
    
    
    ImportError: cannot import name 'Forecast' from 'mlforecast' (/home//mambaforge/envs/dev/lib/python3.7/site-packages/mlforecast/__init__.py)
    # Stacktrace
    

    Environment info

    python=3.7 pip installlation mlforecast

    Package version: mlforecast=0.2.0

    Additional information

    opened by iki77 2
  • nb Forecast doens't run in latest pypi version

    nb Forecast doens't run in latest pypi version

    This nb doesn't work with latest pypi mlforecast version (installing via pip install mlforecast, version 0.2.0) https://github.com/Nixtla/mlforecast/blob/6ac01ec16e1da2d04ca8ea9e4d4a2ed173f7c534/nbs/forecast.ipynb

    To make it work, I had to specifically pass the same package as in github: pip install git+https://github.com/Nixtla/mlforecast.git#egg=mlforecast

    opened by Gabrielcidral1 2
  • sort only ds and y columns on fit

    sort only ds and y columns on fit

    Description

    Since the input for the transformations has to be sorted we used to sort the whole dataframe, however this can be very inefficient when there are many dynamic columns. This PR sorts using only the ds and y columns before constructing the GroupedArray thus keeping the peak memory usage constant with respect to the number of dynamic features.

    Checklist:

    • [x] This PR has a meaningful title and a clear description.
    • [x] The tests pass.
    • [ ] There isn't a decrease in the tests coverage.
    • [x] All linting tasks pass.
    • [x] The notebooks are clean.
    • [x] If this modifies the docs, you've made sure that they were updated correctly.
    opened by jmoralez 2
  • Bug:  When using Forecast.backtest on a series with freq='W', y_pred contains null values

    Bug: When using Forecast.backtest on a series with freq='W', y_pred contains null values

    Code to reproduce:

    import pandas as pd
    import numpy as np
    from sklearn.linear_model import LinearRegression
    from mlforecast.core import TimeSeries
    from mlforecast.forecast import Forecast
    
    #Generate weekly data
    #https://towardsdatascience.com/forecasting-with-machine-learning-models-95a6b6579090
    
    rng = np.random.RandomState(90)
    serie_length = 52 * 4  #4 years' weekly data
    dates = pd.date_range('2000-01-01', freq='W', periods=serie_length, name='ds')
    y = dates.dayofweek + rng.randint(-1, 2, size=dates.size)
    data = pd.DataFrame({'y': y.astype(np.float64)}, index=dates)
    #data.plot(marker='.', figsize=(20, 6));
    
    train_mlfcst = data.reset_index()[['ds', 'y']]
    train_mlfcst.index = pd.Index(np.repeat(0, data.shape[0]), name='unique_id')
    
    backtest_fcst = Forecast(
        LinearRegression(fit_intercept=False), TimeSeries(lags=[4, 8])
    )
    backtest_results = backtest_fcst.backtest(train_mlfcst, n_windows=2, window_size=52)
    
    result1 = next(backtest_results)
    result1
    
    	ds	y	y_pred
    unique_id			
    0	2001-12-30	6.0	5.105716
    0	2002-01-06	5.0	5.026820
    0	2002-01-13	7.0	4.640784
    0	2002-01-20	5.0	6.145316
    0	2002-01-27	6.0	4.746834
    0	2002-02-03	6.0	4.635672
    0	2002-02-10	7.0	4.271653
    0	2002-02-17	7.0	NaN
    0	2002-02-24	7.0	NaN
    0	2002-03-03	5.0	NaN
    0	2002-03-10	5.0	NaN
    0	2002-03-17	7.0	NaN
    0	2002-03-24	7.0	NaN
    0	2002-03-31	5.0	NaN
    0	2002-04-07	7.0	NaN
    0	2002-04-14	5.0	NaN
    0	2002-04-21	6.0	NaN
    0	2002-04-28	5.0	NaN
    0	2002-05-05	7.0	NaN
    0	2002-05-12	7.0	NaN
    0	2002-05-19	5.0	NaN
    0	2002-05-26	6.0	NaN
    0	2002-06-02	5.0	NaN
    0	2002-06-09	6.0	NaN
    0	2002-06-16	5.0	NaN
    0	2002-06-23	6.0	NaN
    0	2002-06-30	6.0	NaN
    0	2002-07-07	6.0	NaN
    0	2002-07-14	7.0	NaN
    0	2002-07-21	5.0	NaN
    0	2002-07-28	6.0	NaN
    0	2002-08-04	6.0	NaN
    0	2002-08-11	5.0	NaN
    0	2002-08-18	7.0	NaN
    0	2002-08-25	7.0	NaN
    0	2002-09-01	6.0	NaN
    0	2002-09-08	5.0	NaN
    0	2002-09-15	6.0	NaN
    0	2002-09-22	5.0	NaN
    0	2002-09-29	5.0	NaN
    0	2002-10-06	6.0	NaN
    0	2002-10-13	5.0	NaN
    0	2002-10-20	6.0	NaN
    0	2002-10-27	5.0	NaN
    0	2002-11-03	6.0	NaN
    0	2002-11-10	5.0	NaN
    0	2002-11-17	7.0	NaN
    0	2002-11-24	7.0	NaN
    0	2002-12-01	6.0	NaN
    0	2002-12-08	5.0	NaN
    0	2002-12-15	5.0	NaN
    0	2002-12-22	6.0	NaN
    
    opened by AMKiller 2
  • Fix parquet writes for distributed in cli

    Fix parquet writes for distributed in cli

    Description

    A recent change in dask created an error when trying to write a dask dataframe built from futures to parquet. This solves that issue.

    Checklist:

    • [x] This PR has a meaningful title and a clear description.
    • [x] The tests pass.
    • [x] There isn't a decrease in the tests coverage.
    • [x] All linting tasks pass.
    • [x] The notebooks are clean.
    • [x] If this modifies the docs, you've made sure that they were updated correctly.
    opened by jmoralez 2
  • Support one model per horizon approach

    Support one model per horizon approach

    Description

    We currently support only the recursive strategy where the same model is used to predict over the complete horizon and the model's predictions are used to update the target and recompute the features.

    This adds a max_horizon argument to MLForecast.fit to indicate that it should train that many models and use each to predict its corresponding horizon when calling MLForecast.predict.

    Checklist:

    • [x] This PR has a meaningful title and a clear description.
    • [x] The tests pass.
    • [x] All linting tasks pass.
    • [x] The notebooks are clean.
    feature 
    opened by jmoralez 1
Releases(v0.4.0)
  • v0.4.0(Nov 25, 2022)

    What's Changed

    • rename Forecast to MLForecast by @jmoralez in https://github.com/Nixtla/mlforecast/pull/63

    Full Changelog: https://github.com/Nixtla/mlforecast/compare/v0.3.1...v0.4.0

    Source code(tar.gz)
    Source code(zip)
  • v0.3.1(Nov 9, 2022)

    What's Changed

    • fix unused arguments by @jmoralez in https://github.com/Nixtla/mlforecast/pull/61

    Full Changelog: https://github.com/Nixtla/mlforecast/compare/v0.3.0...v0.3.1

    Source code(tar.gz)
    Source code(zip)
  • v0.3.0(Nov 1, 2022)

    What's Changed

    • raise error when serie is too short for backtest by @jmoralez in https://github.com/Nixtla/mlforecast/pull/32
    • allow models list by @jmoralez (#34, #36)
    • [FEAT] Allow used by GitHub section hardcoding lib name by @FedericoGarza in https://github.com/Nixtla/mlforecast/pull/37
    • [FIX] Add black as a development dependency by @FedericoGarza in https://github.com/Nixtla/mlforecast/pull/38
    • rename backtest to cross_validation and return single dataframe by @jmoralez in https://github.com/Nixtla/mlforecast/pull/41
    • Remove TimeSeries from Forecast constructor by @jmoralez in https://github.com/Nixtla/mlforecast/pull/44
    • allow passing column names as arguments. allow ds to be int by @jmoralez in https://github.com/Nixtla/mlforecast/pull/45
    • add LightGBMCV by @jmoralez in https://github.com/Nixtla/mlforecast/pull/48
    • support applying differences to series by @jmoralez in https://github.com/Nixtla/mlforecast/pull/52
    • allow functions as date features by @jmoralez in https://github.com/Nixtla/mlforecast/pull/57
    • Improve docs by @jmoralez in https://github.com/Nixtla/mlforecast/pull/59

    New Contributors

    • @FedericoGarza made their first contribution in https://github.com/Nixtla/mlforecast/pull/37

    Full Changelog: https://github.com/Nixtla/mlforecast/compare/v0.2.0...v0.3.0

    Source code(tar.gz)
    Source code(zip)
Owner
Nixtla
Open Source Time Series Forecasting
Nixtla
A list of awesome PyTorch scholarship articles, guides, blogs, courses and other resources.

Awesome PyTorch Scholarship Resources A collection of awesome PyTorch and Python learning resources. Contributions are always welcome! Course Informat

Arnas Gečas 302 Dec 03, 2022
Surrogate- and Invariance-Boosted Contrastive Learning (SIB-CL)

Surrogate- and Invariance-Boosted Contrastive Learning (SIB-CL) This repository contains all source code used to generate the results in the article "

Charlotte Loh 3 Jul 23, 2022
Face Transformer for Recognition

Face-Transformer This is the code of Face Transformer for Recognition (https://arxiv.org/abs/2103.14803v2). Recently there has been great interests of

Zhong Yaoyao 153 Nov 30, 2022
Code and real data for the paper "Counterfactual Temporal Point Processes", available at arXiv.

counterfactual-tpp This is a repository containing code and real data for the paper Counterfactual Temporal Point Processes. Pre-requisites This code

Networks Learning 11 Dec 09, 2022
[ICCV 2021 (oral)] Planar Surface Reconstruction from Sparse Views

Planar Surface Reconstruction From Sparse Views Linyi Jin, Shengyi Qian, Andrew Owens, David F. Fouhey University of Michigan ICCV 2021 (Oral) This re

Linyi Jin 89 Jan 05, 2023
Quantization library for PyTorch. Support low-precision and mixed-precision quantization, with hardware implementation through TVM.

HAWQ: Hessian AWare Quantization HAWQ is an advanced quantization library written for PyTorch. HAWQ enables low-precision and mixed-precision uniform

Zhen Dong 293 Dec 30, 2022
Source code, datasets and trained models for the paper Learning Advanced Mathematical Computations from Examples (ICLR 2021), by François Charton, Amaury Hayat (ENPC-Rutgers) and Guillaume Lample

Maths from examples - Learning advanced mathematical computations from examples This is the source code and data sets relevant to the paper Learning a

Facebook Research 171 Nov 23, 2022
[CVPR 2021] Exemplar-Based Open-Set Panoptic Segmentation Network (EOPSN)

EOPSN: Exemplar-Based Open-Set Panoptic Segmentation Network (CVPR 2021) PyTorch implementation for EOPSN. We propose open-set panoptic segmentation t

Jaedong Hwang 49 Dec 30, 2022
AnimationKit: AI Upscaling & Interpolation using Real-ESRGAN+RIFE

ALPHA 2.5: Frostbite Revival (Released 12/23/21) Changelog: [ UI ] Chained design. All steps link to one another! Use the master override toggles to s

87 Nov 16, 2022
This is the official PyTorch implementation of our paper: "Artistic Style Transfer with Internal-external Learning and Contrastive Learning".

Artistic Style Transfer with Internal-external Learning and Contrastive Learning This is the official PyTorch implementation of our paper: "Artistic S

51 Dec 20, 2022
Aalto-cs-msc-theses - Listing of M.Sc. Theses of the Department of Computer Science at Aalto University

Aalto-CS-MSc-Theses Listing of M.Sc. Theses of the Department of Computer Scienc

Jorma Laaksonen 3 Jan 27, 2022
[NeurIPS 2021] Large Scale Learning on Non-Homophilous Graphs: New Benchmarks and Strong Simple Methods

Large Scale Learning on Non-Homophilous Graphs: New Benchmarks and Strong Simple Methods Large Scale Learning on Non-Homophilous Graphs: New Benchmark

60 Jan 03, 2023
NU-Wave: A Diffusion Probabilistic Model for Neural Audio Upsampling

NU-Wave: A Diffusion Probabilistic Model for Neural Audio Upsampling For Official repo of NU-Wave: A Diffusion Probabilistic Model for Neural Audio Up

Rishikesh (ऋषिकेश) 38 Oct 11, 2022
Official implementation of "Articulation Aware Canonical Surface Mapping"

Articulation-Aware Canonical Surface Mapping Nilesh Kulkarni, Abhinav Gupta, David F. Fouhey, Shubham Tulsiani Paper Project Page Requirements Python

Nilesh Kulkarni 56 Dec 16, 2022
KwaiRec: A Fully-observed Dataset for Recommender Systems (Density: Almost 100%)

KuaiRec: A Fully-observed Dataset for Recommender Systems (Density: Almost 100%) KuaiRec is a real-world dataset collected from the recommendation log

Chongming GAO (高崇铭) 70 Dec 28, 2022
Improving Contrastive Learning by Visualizing Feature Transformation, ICCV 2021 Oral

Improving Contrastive Learning by Visualizing Feature Transformation This project hosts the codes, models and visualization tools for the paper: Impro

Bingchen Zhao 83 Dec 15, 2022
🥇Samsung AI Challenge 2021 1등 솔루션입니다🥇

MoT - Molecular Transformer Large-scale Pretraining for Molecular Property Prediction Samsung AI Challenge for Scientific Discovery This repository is

Jungwoo Park 44 Dec 03, 2022
Easy way to add GoogleMaps to Flask applications. maintainer: @getcake

Flask Google Maps Easy to use Google Maps in your Flask application requires Jinja Flask A google api key get here Contribute To contribute with the p

Flask Extensions 611 Dec 05, 2022
From a body shape, infer the anatomic skeleton.

OSSO: Obtaining Skeletal Shape from Outside (CVPR 2022) This repository contains the official implementation of the skeleton inference from: OSSO: Obt

Marilyn Keller 166 Dec 28, 2022
交互式标注软件,暂定名 iann

iann 交互式标注软件,暂定名iann。 安装 按照官网介绍安装paddle。 安装其他依赖 pip install -r requirements.txt 运行 git clone https://github.com/PaddleCV-SIG/iann/ cd iann python iann

294 Dec 30, 2022