Stacked Generalization (Ensemble Learning)

Overview

Stacking (stacked generalization)

PyPI version license

Overview

ikki407/stacking - Simple and useful stacking library, written in Python.

User can use models of scikit-learn, XGboost, and Keras for stacking.
As a feature of this library, all out-of-fold predictions can be saved for further analisys after training.

Description

Stacking (sometimes called stacked generalization) involves training a learning algorithm to combine the predictions of several other learning algorithms. The basic idea is to use a pool of base classifiers, then using another classifier to combine their predictions, with the aim of reducing the generalization error.

This blog is very helpful to understand stacking and ensemble learning.

Usage

See working example:

To run these examples, just run sh run.sh. Note that:

  1. Set train and test dataset under data/input

  2. Created features from original dataset need to be under data/output/features

  3. Models for stacking are defined in scripts.py under scripts folder

  4. Need to define created features in that scripts

  5. Just run sh run.sh (python scripts/XXX.py).

Detailed Usage

  1. Set train dataset with its target data and test dataset.

    FEATURE_LIST_stage1 = {
                    'train':(
                             INPUT_PATH + 'train.csv',
                             FEATURES_PATH + 'train_log.csv',
                            ),
    
                    'target':(
                             INPUT_PATH + 'target.csv',
                            ),
    
                    'test':(
                             INPUT_PATH + 'test.csv',
                             FEATURES_PATH + 'test_log.csv',
                            ),
                    }
  2. Define model classes that inherit BaseModel class, which are used in Stage 1, Stage 2, ..., Stage N.

    # For Stage 1
    PARAMS_V1 = {
            'colsample_bytree':0.80,
            'learning_rate':0.1,"eval_metric":"auc",
            'max_depth':5, 'min_child_weight':1,
            'nthread':4,
            'objective':'binary:logistic','seed':407,
            'silent':1, 'subsample':0.60,
            }
    
    class ModelV1(BaseModel):
            def build_model(self):
                return XGBClassifier(params=self.params, num_round=10)
    
    ...
    
    # For Stage 2
    PARAMS_V1_stage2 = {
                        'penalty':'l2',
                        'tol':0.0001, 
                        'C':1.0, 
                        'random_state':None, 
                        'verbose':0, 
                        'n_jobs':8
                        }
    
    class ModelV1_stage2(BaseModel):
            def build_model(self):
                return LR(**self.params)
  3. Train each models of Stage 1 for stacking.

    m = ModelV1(name="v1_stage1",
                flist=FEATURE_LIST_stage1,
                params = PARAMS_V1,
                kind = 'st'
                )
    m.run()
    
    ...
  4. Train each model(s) of Stage 2 by using the prediction of Stage-1 models.

    FEATURE_LIST_stage2 = {
                'train': (
                         TEMP_PATH + 'v1_stage1_all_fold.csv',
                         TEMP_PATH + 'v2_stage1_all_fold.csv',
                         TEMP_PATH + 'v3_stage1_all_fold.csv',
                         TEMP_PATH + 'v4_stage1_all_fold.csv',
                         ...
                         ),
    
                'target':(
                         INPUT_PATH + 'target.csv',
                         ),
    
                'test': (
                        TEMP_PATH + 'v1_stage1_test.csv',
                        TEMP_PATH + 'v2_stage1_test.csv',
                        TEMP_PATH + 'v3_stage1_test.csv',
                        TEMP_PATH + 'v4_stage1_test.csv',
                        ...                     
                        ),
                }
    
    # Models
    m = ModelV1_stage2(name="v1_stage2",
                    flist=FEATURE_LIST_stage2,
                    params = PARAMS_V1_stage2,
                    kind = 'st',
                    )
    m.run()
  5. Final result is saved as v1_stage2_TestInAllTrainingData.csv.

Prerequisite

  • (MaxOS) Install xgboost first manually: pip install xgboost
  • (Optional) Install paratext: fast csv loading library

Installation

To install stacking, cd to the stacking folder and run the install command**(up-to-date version, recommended)**:

sudo python setup.py install

You can also install stacking from PyPI:

pip install stacking

Files

Details of scripts

  • base.py:
    • Base models for stacking are defined here (using sklearn.base.BaseEstimator).
    • Some models are defined here. e.g., XGBoost, Keras, Vowpal Wabbit.
    • These models are wrapped as scikit-learn like (using sklearn.base.ClassifierMixin, sklearn.base.RegressorMixin).
    • That is, model class has some methods, fit(), predict_proba(), and predict().

New user-defined models can be added here.

Scikit-learn models can be used.

Base model have some arguments.

  • 's': Stacking. Saving oof(out-of-fold) prediction({model_name}_all_fold.csv) and average of test prediction based on train-fold models({model_name}_test.csv). These files will be used for next level stacking.

  • 't': Training with all data and predict test({model_name}_TestInAllTrainingData.csv). In this training, no validation data are used.

  • 'st': Stacking and then training with all data and predict test ('s' and 't').

  • 'cv': Only cross validation without saving the prediction.

Define several models and its parameters used for stacking. Define task details on the top of script. Train and test feature set are defined here. Need to define CV-fold index.

Any level stacking can be defined.

PredictionFiles

Reference

[1] Wolpert, David H. Stacked generalization, Neural Networks, 5(2), 241-259

[2] Ensemble learning(Stacking)

[3] KAGGLE ENSEMBLING GUIDE

Owner
Ikki Tanaka
Data Scientist, Machine Learning/Reinforcement Learning Engineer. Kaggle Master.
Ikki Tanaka
Microsoft contributing libraries, tools, recipes, sample codes and workshop contents for machine learning & deep learning.

Microsoft contributing libraries, tools, recipes, sample codes and workshop contents for machine learning & deep learning.

Microsoft 366 Jan 03, 2023
Predict the output which should give a fair idea about the chances of admission for a student for a particular university

Predict the output which should give a fair idea about the chances of admission for a student for a particular university.

ArvindSandhu 1 Jan 11, 2022
Kubeflow is a machine learning (ML) toolkit that is dedicated to making deployments of ML workflows on Kubernetes simple, portable, and scalable.

SDK: Overview of the Kubeflow pipelines service Kubeflow is a machine learning (ML) toolkit that is dedicated to making deployments of ML workflows on

Kubeflow 3.1k Jan 06, 2023
MosaicML Composer contains a library of methods, and ways to compose them together for more efficient ML training

MosaicML Composer MosaicML Composer contains a library of methods, and ways to compose them together for more efficient ML training. We aim to ease th

MosaicML 2.8k Jan 06, 2023
machine learning model deployment project of Iris classification model in a minimal UI using flask web framework and deployed it in Azure cloud using Azure app service

This is a machine learning model deployment project of Iris classification model in a minimal UI using flask web framework and deployed it in Azure cloud using Azure app service. We initially made th

Krishna Priyatham Potluri 73 Dec 01, 2022
High performance implementation of Extreme Learning Machines (fast randomized neural networks).

High Performance toolbox for Extreme Learning Machines. Extreme learning machines (ELM) are a particular kind of Artificial Neural Networks, which sol

Anton Akusok 174 Dec 07, 2022
pure-predict: Machine learning prediction in pure Python

pure-predict speeds up and slims down machine learning prediction applications. It is a foundational tool for serverless inference or small batch prediction with popular machine learning frameworks l

Ibotta 84 Dec 29, 2022
Distributed deep learning on Hadoop and Spark clusters.

Note: we're lovingly marking this project as Archived since we're no longer supporting it. You are welcome to read the code and fork your own version

Yahoo 1.3k Dec 28, 2022
Apache Spark & Python (pySpark) tutorials for Big Data Analysis and Machine Learning as IPython / Jupyter notebooks

Spark Python Notebooks This is a collection of IPython notebook/Jupyter notebooks intended to train the reader on different Apache Spark concepts, fro

Jose A Dianes 1.5k Jan 02, 2023
Optuna is an automatic hyperparameter optimization software framework, particularly designed for machine learning

Optuna is an automatic hyperparameter optimization software framework, particularly designed for machine learning. It features an imperative, define-by-run style user API.

7.4k Jan 04, 2023
monolish: MONOlithic Liner equation Solvers for Highly-parallel architecture

monolish is a linear equation solver library that monolithically fuses variable data type, matrix structures, matrix data format, vendor specific data transfer APIs, and vendor specific numerical alg

RICOS Co. Ltd. 179 Dec 21, 2022
Responsible Machine Learning with Python

Examples of techniques for training interpretable ML models, explaining ML models, and debugging ML models for accuracy, discrimination, and security.

ph_ 624 Jan 06, 2023
A toolbox to iNNvestigate neural networks' predictions!

iNNvestigate neural networks! Table of contents Introduction Installation Usage and Examples More documentation Contributing Releases Introduction In

Maximilian Alber 1.1k Jan 05, 2023
Time Series Prediction with tf.contrib.timeseries

TensorFlow-Time-Series-Examples Additional examples for TensorFlow Time Series(TFTS). Read a Time Series with TFTS From a Numpy Array: See "test_input

Zhiyuan He 476 Nov 17, 2022
EbookMLCB - ebook Machine Learning cơ bản

Mã nguồn cuốn ebook "Machine Learning cơ bản", Vũ Hữu Tiệp. ebook Machine Learning cơ bản pdf-black_white, pdf-color. Mọi hình thức sao chép, in ấn đề

943 Jan 02, 2023
PyNNDescent is a Python nearest neighbor descent for approximate nearest neighbors.

PyNNDescent PyNNDescent is a Python nearest neighbor descent for approximate nearest neighbors. It provides a python implementation of Nearest Neighbo

Leland McInnes 699 Jan 09, 2023
Machine learning that just works, for effortless production applications

Machine learning that just works, for effortless production applications

Elisha Yadgaran 16 Sep 02, 2022
A simple machine learning package to cluster keywords in higher-level groups.

Simple Keyword Clusterer A simple machine learning package to cluster keywords in higher-level groups. Example: "Senior Frontend Engineer" -- "Fronte

Andrea D'Agostino 10 Dec 18, 2022
Regularization and Feature Selection in Least Squares Temporal Difference Learning

Regularization and Feature Selection in Least Squares Temporal Difference Learning Description This is Python implementations of Least Angle Regressio

Mina Parham 0 Jan 18, 2022
CobraML: Completely Customizable A python ML library designed to give the end user full control

CobraML: Completely Customizable What is it? CobraML is a python library built on both numpy and numba. Unlike other ML libraries CobraML gives the us

Sriram Govindan 14 Dec 19, 2021