Time Series Cross-Validation -- an extension for scikit-learn

Overview

Downloads Build Status codecov DOI

TSCV: Time Series Cross-Validation

This repository is a scikit-learn extension for time series cross-validation. It introduces gaps between the training set and the test set, which mitigates the temporal dependence of time series and prevents information leakage.

Installation

pip install tscv

or

conda install -c conda-forge tscv

Usage

This extension defines 3 cross-validator classes and 1 function:

  • GapLeavePOut
  • GapKFold
  • GapRollForward
  • gap_train_test_split

The three classes can all be passed, as the cv argument, to scikit-learn functions such as cross-validate, cross_val_score, and cross_val_predict, just like the native cross-validator classes.

The one function is an alternative to the train_test_split function in scikit-learn.

Examples

The following example uses GapKFold instead of KFold as the cross-validator.

import numpy as np
from sklearn import datasets
from sklearn import svm
from sklearn.model_selection import cross_val_score
from tscv import GapKFold

iris = datasets.load_iris()
clf = svm.SVC(kernel='linear', C=1)

# use GapKFold as the cross-validator
cv = GapKFold(n_splits=5, gap_before=5, gap_after=5)
scores = cross_val_score(clf, iris.data, iris.target, cv=cv)

The following example uses gap_train_test_split to split the data set into the training set and the test set.

import numpy as np
from tscv import gap_train_test_split

X, y = np.arange(20).reshape((10, 2)), np.arange(10)
X_train, X_test, y_train, y_test = gap_train_test_split(X, y, test_size=2, gap_size=2)

Contributing

  • Report bugs in the issue tracker
  • Express your use cases in the issue tracker

Documentations

Acknowledgments

  • I would like to thank Jeffrey Racine and Christoph Bergmeir for the helpful discussion.

License

BSD-3-Clause

Citation

Wenjie Zheng. (2021). Time Series Cross-Validation (TSCV): an extension for scikit-learn. Zenodo. http://doi.org/10.5281/zenodo.4707309

@software{zheng_2021_4707309,
  title={{Time Series Cross-Validation (TSCV): an extension for scikit-learn}},
  author={Zheng, Wenjie},
  month={april},
  year={2021},
  publisher={Zenodo},
  doi={10.5281/zenodo.4707309},
  url={http://doi.org/10.5281/zenodo.4707309}
}
Comments
  • Make it work with cross_val_predict

    Make it work with cross_val_predict

    Is it possible to somehow make the CV work with cross_val_predict function. Fore example, if I try:

    cv = GapWalkForward(n_splits=3, gap_size=1, test_size=2)
    cross_val_predict(estimator=SGDClassifier(), X=X_sample, y=y_bin_sample, cv=cv, n_jobs=6)
    

    it returns an error

    ValueError: cross_val_predict only works for partitions

    but I would like to have predictions so I can make consfusion matrx and other statistics.

    Is it possible to make it work with your cross-validators?

    opened by MislavSag 8
  • Documentation

    Documentation

    Documentation and examples do not address the splitting of data set into training and test sets.

    If using one of the cross validators, does the data set need to be sorted in time order? Is there way to designate a datetime column so the class understands on what basis to sequentially split data?

    opened by mksamelson 3
  • split.py depends on deprecated / newly private method `_safe_indexing` in scikit-learn 0.24.0

    split.py depends on deprecated / newly private method `_safe_indexing` in scikit-learn 0.24.0

    Just flagging a minor issue:

    We found this after poetry update-ing our dependencies, inadvertently bumping scikit-learn to 0.24.0. This broke code we have that uses tscv

    relevant scikit-learn source-code from version 0.23.0 https://github.com/scikit-learn/scikit-learn/blob/0.23.0/sklearn/utils/init.py#L274-L275

    The method has been made private in scikit-learn 0.24.0: https://github.com/scikit-learn/scikit-learn/blob/0.24.0/sklearn/utils/init.py#L271

    I did not investigate further, we pinned scikit-learn to 0.23.0 and that's OK for now, but some refactoring may be in order to move off the private method.

    opened by rob-sokolowski 3
  • Error when Importing TSCV Gapwalkforward

    Error when Importing TSCV Gapwalkforward

    Using TSCV Gapwalkforward successfully with Python 3.7.

    Suddenly getting following error:

    ImportError Traceback (most recent call last) in 41 #Modeling 42 ---> 43 from tscv import GapWalkForward 44 from sklearn.utils import shuffle 45 from sklearn.model_selection import KFold

    ~\Anaconda3\envs\py37\lib\site-packages\tscv_init_.py in ----> 1 from .split import GapCrossValidator 2 from .split import GapLeavePOut 3 from .split import GapKFold 4 from .split import GapWalkForward 5 from .split import gap_train_test_split

    ~\Anaconda3\envs\py37\lib\site-packages\tscv\split.py in 7 8 import numpy as np ----> 9 from sklearn.utils import indexable, safe_indexing 10 from sklearn.utils.validation import _num_samples 11 from sklearn.base import _pprint

    ImportError: cannot import name 'safe_indexing' from 'sklearn.utils'

    Any insight? I get this when simply importing Gapwalkforward.

    opened by mksamelson 2
  • GapWalkForward Issue with Scikit-learn 0.24.1

    GapWalkForward Issue with Scikit-learn 0.24.1

    When I upgrade to Scikit-learn 0.24.1 I get an issue:

    cannot import name 'safe_indexing' from 'sklearn.utils'

    This appears to be a change within scikit-learn as indicated here:

    https://stackoverflow.com/questions/65602076/yellowbrick-importerror-cannot-import-name-safe-indexing-from-sklearn-utils

    No issue using scikit-learn 0.23.2

    opened by mksamelson 2
  • Release 0.0.4 for GridSearch compat

    Release 0.0.4 for GridSearch compat

    Would it be possible to issue a new release on PyPI to include the latest changes from this commit which aligns the get_n_splits method signature with the abstract method signature required by GridSearchCV?

    opened by wderose 2
  • Warning once is not enough

    Warning once is not enough

    https://github.com/WenjieZ/TSCV/blob/f8b832fab1dca0e2d2d46029308c2d06eef8b858/tscv/split.py#L253

    This warning should appear for every occurrence. Use standard output instead.

    opened by WenjieZ 1
  • Retrained version of GapWalkForward: GapRollForward

    Retrained version of GapWalkForward: GapRollForward

    The current implementation is based on legacy K-Fold cross-validation requiring an explicit value for the n_splits parameter. It puts the burden of calculating desired value of n_splits on the user.

    A better implementation should allow the user to initiate a GapWalkForward class without specifying the value for n_splits. Instead, it can deduct the right value through the other inputs.

    It is theoretically desirable to keep both channels of kickstarting a GapWalkForward class. In practice, however, it is hard to maintain both within a single class. Therefore, I decide to ~~deprecate the n_splits channel~~ implement a new class dubbed GapRollForward in v0.1.0 -- the version after the next.

    opened by WenjieZ 1
  • Changed GapWalkForward.get_n_splits to match abstract method signatur…

    Changed GapWalkForward.get_n_splits to match abstract method signatur…

    …e. Now works with GridSearchCV. Otherwise using GapWalkForward as the cross validation class passed to GridSearchCV will fail with "TypeError: get_n_splits() takes 1 positional argument but 4 were given."

    opened by lawsonmcw 1
  • Import error with latest sklearn version

    Import error with latest sklearn version

    Hi guys, this issue occured after the upgrade to 1.1.3

    ImportError: cannot import name '_pprint' from 'sklearn.base'

    /.venv/lib/python3.10/site-packages/tscv/_split.py:19 in      │
    │ <module>                                                                                         │
    │                                                                                                  │
    │    16 import numpy as np                                                                         │
    │    17 from sklearn.utils import indexable                                                        │
    │    18 from sklearn.utils.validation import _num_samples, check_consistent_length                 │
    │ ❱  19 from sklearn.base import _pprint                                                           │
    │    20 from sklearn.utils import _safe_indexing                                                   │
    │    21                                                                                            │
    │    22                                       
    

    Could you please fix it ?

    Kind regards, Jim

    opened by teneon 1
  • Consistently use the test sets as reference for `gap_before` and `gap_after`

    Consistently use the test sets as reference for `gap_before` and `gap_after`

    There are two ways of defining a derived cross-validator. One is to redefine _iter_test_indices or _iter_test_masks (test viewpoint), and the other is to redefine _iter_train_masks or _iter_train_indices (train viewpoint).

    Currently, these two methods assign different semantic meanings to the parameters gap_before and gap_after. The test viewpoint uses the test sets as the reference:

    train    gap_before    test    gap_after    train
    

    The train viewpoint uses the training sets as the reference:

    test    gap_before    train    gap_after    test
    

    This diverged behavior is ~~not intended~~ inappropriate. The package should insist on the test viewpoint, and hence this PR. It will be enforced in v0.2.

    I don't think this issue has touched any users, for the derived classes in this package use _iter_test_indices exclusively (test viewpoint). No users have reported this issue either. If you suspect that you have been affected by it, please reply to this PR.

    opened by WenjieZ 1
  • time boost in folds generation

    time boost in folds generation

    With contiguous test sets:

    cv_orig = GapKFold(n_splits=5, gap_before=1, gap_after=1)
    
    for train_index, test_index in cv_orig.split(np.arange(10)):
        print("TRAIN:", train_index, "TEST:", test_index)
    
    
    ... TRAIN: [3 4 5 6 7 8 9] TEST: [0 1]
    ... TRAIN: [0 5 6 7 8 9] TEST: [2 3]
    ... TRAIN: [0 1 2 7 8 9] TEST: [4 5]
    ... TRAIN: [0 1 2 3 4 9] TEST: [6 7]
    ... TRAIN: [0 1 2 3 4 5 6] TEST: [8 9]
    
    cv_opt = GapKFold(n_splits=5, gap_before=1, gap_after=1)
    
    for train_index, test_index in cv_opt.split(np.arange(10)):
        print("TRAIN:", train_index, "TEST:", test_index)
    
    
    ... TRAIN: [3 4 5 6 7 8 9] TEST: [0 1]
    ... TRAIN: [0 5 6 7 8 9] TEST: [2 3]
    ... TRAIN: [0 1 2 7 8 9] TEST: [4 5]
    ... TRAIN: [0 1 2 3 4 9] TEST: [6 7]
    ... TRAIN: [0 1 2 3 4 5 6] TEST: [8 9]
    
    %%timeit
    folds = list(cv_orig.split(np.arange(10000)))
    
    
    ... 1.21 s ± 37.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
    
    %%timeit
    folds = list(cv_opt.split(np.arange(10000)))
    
    
    ... 4.74 ms ± 44.9 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
    

    With uncontiguous test sets:

    cv_orig = _XXX_(_xxx_, gap_before=1, gap_after=1)
    
    for train_index, test_index in cv_orig.split(np.arange(10)):
        print("TRAIN:", train_index, "TEST:", test_index)
    
    
    ... TRAIN: [5 6 7 8 9] TEST: [0 1 2 3]
    ... TRAIN: [7 8 9] TEST: [0 1 4 5]
    ... TRAIN: [3 4 9] TEST: [0 1 6 7]
    ... TRAIN: [3 4 5 6] TEST: [0 1 8 9]
    ... TRAIN: [0 7 8 9] TEST: [2 3 4 5]
    ... TRAIN: [0 9] TEST: [2 3 6 7]
    ... TRAIN: [0 5 6] TEST: [2 3 8 9]
    ... TRAIN: [0 1 2 9] TEST: [4 5 6 7]
    ... TRAIN: [0 1 2] TEST: [4 5 8 9]
    ... TRAIN: [0 1 2 3 4] TEST: [6 7 8 9]
    
    cv_opt = _XXX_(_xxx_, gap_before=1, gap_after=1)
    
    for train_index, test_index in cv_opt.split(np.arange(10)):
        print("TRAIN:", train_index, "TEST:", test_index)
    
    
    ... TRAIN: [5 6 7 8 9] TEST: [0 1 2 3]
    ... TRAIN: [7 8 9] TEST: [0 1 4 5]
    ... TRAIN: [3 4 9] TEST: [0 1 6 7]
    ... TRAIN: [3 4 5 6] TEST: [0 1 8 9]
    ... TRAIN: [0 7 8 9] TEST: [2 3 4 5]
    ... TRAIN: [0 9] TEST: [2 3 6 7]
    ... TRAIN: [0 5 6] TEST: [2 3 8 9]
    ... TRAIN: [0 1 2 9] TEST: [4 5 6 7]
    ... TRAIN: [0 1 2] TEST: [4 5 8 9]
    ... TRAIN: [0 1 2 3 4] TEST: [6 7 8 9]
    
    %%timeit
    folds = list(cv_orig.split(np.arange(10000)))
    
    ... 1.23 s ± 75.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
    
    %%timeit
    folds = list(cv_opt.split(np.arange(10000)))
    
    ... 4.78 ms ± 49.8 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
    
    opened by aldder 3
  • add CombinatorialGapKFold

    add CombinatorialGapKFold

    From "Advances in Financial Machine Learning" book by Marcos López de Prado the implemented version of Combinatorial Cross Validation with Purging and Embargoing

    image

    explaining video: https://www.youtube.com/watch?v=hDQssGntmFA

    opened by aldder 3
  • Implement Rep-Holdout

    Implement Rep-Holdout

    Thank you for this repository and the implemented CV-methods; especially GapRollForward. I was looking for exactly this package.

    I was wondering if you are interested in implementing another CV-Method for time series, called Rep-Holdout. It is used in this evaluation paper (https://arxiv.org/abs/1905.11744) and has good performance compared to all other CV-methods - some of which you have implemented here.

    As I understand it, it is somewhat like sklearn.model_selection.TimeSeriesSplit but with a randomized selection of all possible folds. Here is the description from the paper as an image:

    Unbenannt


    The authors provided code in R but it is written very differently than how it needs to look in Python. I adapted your functions to implement it in python but I am not the best coder and it really only serves my purpose of tuning a specific model. Seeing as the performance of Rep-Holdout is good and -to me at least - it makes sense for time series cross validation, maybe you are interested in adding this function to your package?

    opened by georgeblck 8
  • Intution on setting number of gaps

    Intution on setting number of gaps

    If for example, I have data without gaps, when and why would I still create a break between my train and validation? I have seen the argument for setting gaps when the period that needs to be predicted may be N days after the train. Are there other reasons? And if so, what is the intuition on knowing how many gaps to include before/after the training set?

    opened by tyokota 0
Releases(v0.1.2)
Owner
Wenjie Zheng
Statistical Learning Solution Expert
Wenjie Zheng
A Python framework for conversational search

Chatty Goose Multi-stage Conversational Passage Retrieval: An Approach to Fusing Term Importance Estimation and Neural Query Rewriting Installation Ma

Castorini 36 Oct 23, 2022
Experiments on continual learning from a stream of pretrained models.

Ex-model CL Ex-model continual learning is a setting where a stream of experts (i.e. model's parameters) is available and a CL model learns from them

Antonio Carta 6 Dec 04, 2022
Segment axon and myelin from microscopy data using deep learning

Segment axon and myelin from microscopy data using deep learning. Written in Python. Using the TensorFlow framework. Based on a convolutional neural network architecture. Pixels are classified as eit

NeuroPoly 103 Nov 29, 2022
[ICLR 2022] Pretraining Text Encoders with Adversarial Mixture of Training Signal Generators

AMOS This repository contains the scripts for fine-tuning AMOS pretrained models on GLUE and SQuAD 2.0 benchmarks. Paper: Pretraining Text Encoders wi

Microsoft 22 Sep 15, 2022
Google AI Open Images - Object Detection Track: Open Solution

Google AI Open Images - Object Detection Track: Open Solution This is an open solution to the Google AI Open Images - Object Detection Track 😃 More c

minerva.ml 46 Jun 22, 2022
Ansible Automation Example: JSNAPY PRE/POST Upgrade Validation

Ansible Automation Example: JSNAPY PRE/POST Upgrade Validation Overview This example will show how to validate the status of our firewall before and a

Calvin Remsburg 1 Jan 07, 2022
The dataset of tweets pulling from Twitters with keyword: Hydroxychloroquine, location: US, Time: 2020

HCQ_Tweet_Dataset: FREE to Download. Keywords: HCQ, hydroxychloroquine, tweet, twitter, COVID-19 This dataset is associated with the paper "Understand

2 Mar 16, 2022
Code for the paper "Query Embedding on Hyper-relational Knowledge Graphs"

Query Embedding on Hyper-Relational Knowledge Graphs This repository contains the code used for the experiments in the paper Query Embedding on Hyper-

DimitrisAlivas 19 Jul 26, 2022
A FAIR dataset of TCV experimental results for validating edge/divertor turbulence models.

TCV-X21 validation for divertor turbulence simulations Quick links Intro Welcome to TCV-X21. We're glad you've found us! This repository is designed t

0 Dec 18, 2021
AlphaBot2 Pi Core software for interfacing with the various components.

AlphaBot2-Pi-Core AlphaBot2 Pi Core software for interfacing with the various components. This project is currently a W.I.P. I will update this readme

KyleDev 1 Feb 13, 2022
Change is Everywhere: Single-Temporal Supervised Object Change Detection in Remote Sensing Imagery (ICCV 2021)

Change is Everywhere Single-Temporal Supervised Object Change Detection in Remote Sensing Imagery by Zhuo Zheng, Ailong Ma, Liangpei Zhang and Yanfei

Zhuo Zheng 125 Dec 13, 2022
SPCL: A New Framework for Domain Adaptive Semantic Segmentation via Semantic Prototype-based Contrastive Learning

SPCL SPCL: A New Framework for Domain Adaptive Semantic Segmentation via Semantic Prototype-based Contrastive Learning Update on 2021/11/25: ArXiv Ver

Binhui Xie (谢斌辉) 11 Oct 29, 2022
Contenido del curso Bases de datos del DCC PUC versión 2021-2

IIC2413 - Bases de Datos Tabla de contenidos Equipo Profesores Ayudantes Contenidos Calendario Evaluaciones Resumen de notas Foro Política de integrid

54 Nov 23, 2022
Code for NeurIPS 2020 article "Contrastive learning of global and local features for medical image segmentation with limited annotations"

Contrastive learning of global and local features for medical image segmentation with limited annotations The code is for the article "Contrastive lea

Krishna Chaitanya 152 Dec 22, 2022
Implementation of "Meta-rPPG: Remote Heart Rate Estimation Using a Transductive Meta-Learner"

Meta-rPPG: Remote Heart Rate Estimation Using a Transductive Meta-Learner This repository is the official implementation of Meta-rPPG: Remote Heart Ra

Eugene Lee 137 Dec 13, 2022
All of the figures and notebooks for my deep learning book, for free!

"Deep Learning - A Visual Approach" by Andrew Glassner This is the official repo for my book from No Starch Press. Ordering the book My book is called

Andrew Glassner 227 Jan 04, 2023
MarcoPolo is a clustering-free approach to the exploration of bimodally expressed genes along with group information in single-cell RNA-seq data

MarcoPolo is a method to discover differentially expressed genes in single-cell RNA-seq data without depending on prior clustering Overview MarcoPolo

Chanwoo Kim 13 Dec 18, 2022
PyTorch implementation of the REMIND method from our ECCV-2020 paper "REMIND Your Neural Network to Prevent Catastrophic Forgetting"

REMIND Your Neural Network to Prevent Catastrophic Forgetting This is a PyTorch implementation of the REMIND algorithm from our ECCV-2020 paper. An ar

Tyler Hayes 72 Nov 27, 2022
Credit fraud detection in Python using a Jupyter Notebook

Credit-Fraud-Detection - Credit fraud detection in Python using a Jupyter Notebook , using three classification models (Random Forest, Gaussian Naive Bayes, Logistic Regression) from the sklearn libr

Ali Akram 4 Dec 28, 2021
Densely Connected Search Space for More Flexible Neural Architecture Search (CVPR2020)

DenseNAS The code of the CVPR2020 paper Densely Connected Search Space for More Flexible Neural Architecture Search. Neural architecture search (NAS)

Jamin Fong 291 Nov 18, 2022