hgboost - Hyperoptimized Gradient Boosting

Overview

hgboost - Hyperoptimized Gradient Boosting

Python PyPI Version License Github Forks GitHub Open Issues Project Status Downloads Downloads Sphinx Open In Colab BuyMeCoffee DOI

Star it if you like it!

hgboost is short for Hyperoptimized Gradient Boosting and is a python package for hyperparameter optimization for xgboost, catboost and lightboost using cross-validation, and evaluating the results on an independent validation set. hgboost can be applied for classification and regression tasks.

hgboost is fun because:

* 1. Hyperoptimization of the Parameter-space using bayesian approach.
* 2. Determines the best scoring model(s) using k-fold cross validation.
* 3. Evaluates best model on independent evaluation set.
* 4. Fit model on entire input-data using the best model.
* 5. Works for classification and regression
* 6. Creating a super-hyperoptimized model by an ensemble of all individual optimized models.
* 7. Return model, space and test/evaluation results.
* 8. Makes insightful plots.

Documentation

Regression example Open regression example In Colab

Classification example Open classification example In Colab

Schematic overview of hgboost

Installation Environment

  • Install hgboost from PyPI (recommended). hgboost is compatible with Python 3.6+ and runs on Linux, MacOS X and Windows.
  • A new environment is recommended and created as following:
conda create -n env_hgboost python=3.6
conda activate env_hgboost

Install newest version hgboost from pypi

pip install hgboost

Force to install latest version

pip install -U hgboost

Install from github-source

pip install git+https://github.com/erdogant/hgboost#egg=master

Import hgboost package

import hgboost as hgboost

Classification example for xgboost, catboost and lightboost:

# Load library
from hgboost import hgboost

# Initialization
hgb = hgboost(max_eval=10, threshold=0.5, cv=5, test_size=0.2, val_size=0.2, top_cv_evals=10, random_state=42)
# Import data
df = hgb.import_example()
y = df['Survived'].values
y = y.astype(str)
y[y=='1']='survived'
y[y=='0']='dead'

# Preprocessing by encoding variables
del df['Survived']
X = hgb.preprocessing(df)
# Fit catboost by hyperoptimization and cross-validation
results = hgb.catboost(X, y, pos_label='survived')

# Fit lightboost by hyperoptimization and cross-validation
results = hgb.lightboost(X, y, pos_label='survived')

# Fit xgboost by hyperoptimization and cross-validation
results = hgb.xgboost(X, y, pos_label='survived')

# [hgboost] >Start hgboost classification..
# [hgboost] >Collecting xgb_clf parameters.
# [hgboost] >Number of variables in search space is [11], loss function: [auc].
# [hgboost] >method: xgb_clf
# [hgboost] >eval_metric: auc
# [hgboost] >greater_is_better: True
# [hgboost] >pos_label: True
# [hgboost] >Total dataset: (891, 204) 
# [hgboost] >Hyperparameter optimization..
#  100% |----| 500/500 [04:39<05:21,  1.33s/trial, best loss: -0.8800619834710744]
# [hgboost] >Best performing [xgb_clf] model: auc=0.881198
# [hgboost] >5-fold cross validation for the top 10 scoring models, Total nr. tests: 50
# 100%|██████████| 10/10 [00:42<00:00,  4.27s/it]
# [hgboost] >Evalute best [xgb_clf] model on independent validation dataset (179 samples, 20.00%).
# [hgboost] >[auc] on independent validation dataset: -0.832
# [hgboost] >Retrain [xgb_clf] on the entire dataset with the optimal parameters settings.
# Plot searched parameter space 
hgb.plot_params()

# Plot summary results
hgb.plot()

# Plot the best tree
hgb.treeplot()

# Plot the validation results
hgb.plot_validation()

# Plot the cross-validation results
hgb.plot_cv()

# use the learned model to make new predictions.
y_pred, y_proba = hgb.predict(X)

Create ensemble model for Classification

from hgboost import hgboost

hgb = hgboost(max_eval=100, threshold=0.5, cv=5, test_size=0.2, val_size=0.2, top_cv_evals=10, random_state=None, verbose=3)

# Import data
df = hgb.import_example()
y = df['Survived'].values
del df['Survived']
X = hgb.preprocessing(df, verbose=0)

results = hgb.ensemble(X, y, pos_label=1)

# use the predictor
y_pred, y_proba = hgb.predict(X)

Create ensemble model for Regression

from hgboost import hgboost

hgb = hgboost(max_eval=100, threshold=0.5, cv=5, test_size=0.2, val_size=0.2, top_cv_evals=10, random_state=None, verbose=3)

# Import data
df = hgb.import_example()
y = df['Age'].values
del df['Age']
I = ~np.isnan(y)
X = hgb.preprocessing(df, verbose=0)
X = X.loc[I,:]
y = y[I]

results = hgb.ensemble(X, y, methods=['xgb_reg','ctb_reg','lgb_reg'])

# use the predictor
y_pred, y_proba = hgb.predict(X)
# Plot the ensemble classification validation results
hgb.plot_validation()

References

* http://hyperopt.github.io/hyperopt/
* https://github.com/dmlc/xgboost
* https://github.com/microsoft/LightGBM
* https://github.com/catboost/catboost

Maintainers

Contribute

  • Contributions are welcome.

Licence See LICENSE for details.

Coffee

  • If you wish to buy me a Coffee for this work, it is very appreciated :)
Comments
  • import error during import hgboost

    import error during import hgboost

    When I finished installation of hgboost and try to import hgboost,there is something wrong,could you please help me out? Details are as follows:

    ImportError Traceback (most recent call last) in ----> 1 from hgboost import hgboost

    C:\ProgramData\Anaconda3\lib\site-packages\hgboost_init_.py in ----> 1 from hgboost.hgboost import hgboost 2 3 from hgboost.hgboost import ( 4 import_example, 5 )

    C:\ProgramData\Anaconda3\lib\site-packages\hgboost\hgboost.py in 9 import classeval as cle 10 from df2onehot import df2onehot ---> 11 import treeplot as tree 12 import colourmap 13

    C:\ProgramData\Anaconda3\lib\site-packages\treeplot_init_.py in ----> 1 from treeplot.treeplot import ( 2 plot, 3 randomforest, 4 xgboost, 5 lgbm,

    C:\ProgramData\Anaconda3\lib\site-packages\treeplot\treeplot.py in 14 import numpy as np 15 from sklearn.tree import export_graphviz ---> 16 from sklearn.tree.export import export_text 17 from subprocess import call 18 import matplotlib.image as mpimg

    ImportError: cannot import name 'export_text' from 'sklearn.tree.export'

    thanks a lot!

    opened by recherHE 3
  • Test:Validation:Train split

    Test:Validation:Train split

    Shouldn't be the new test-train split be test_size=self.test_size/(1-self.val_size) in def _HPOpt(self):. We updated the shape of X in _set_validation_set(self, X, y)

    I'm assuming that the test, train, and validation set ratios are defined on the original data.

    opened by SSLPP 3
  • Treeplot failure - missing graphviz dependency

    Treeplot failure - missing graphviz dependency

    I'm running through the example classification notebook now, and the treeplot fails to render, with the following warning:

    Screen Shot 2022-10-04 at 14 30 21

    It seems that graphviz being a compiled c library is not bundled in pip (it is included in conda install treeplot/graphviz though).

    Since we have no recourse to add this to pip requirements, maybe a sentence in the Instalation instructions warning that graphviz must already be available and/or installed separately.

    (note the suggested apt command for linux is not entirely necessary, because pydot does get installed with treeplot via pip)

    opened by ninjit 2
  • Getting the native model for compatibility with shap.TreeExplainer

    Getting the native model for compatibility with shap.TreeExplainer

    Hello, first of all really nice project. I've just found out about it today and started playing with it a little bit. Is there any way to get the trained model as an XGBoost, LightGBM or CatBoost class in order to fit a shap.TreeExplainer instance to it?

    Thanks in advance! -Nicolás

    opened by nicolasaldecoa 2
  • Xgboost parameter

    Xgboost parameter

    After using the code hgb.plot_params(), the parameter of learning rate is 796. I don't think it's reasonable. Can I see the model parameters optimized by using HyperOptimized parameters?

    QQ截图20210705184733

    opened by LAH19999 2
  • HP Tuning: best_model uses different parameters from those that were reported as best ones

    HP Tuning: best_model uses different parameters from those that were reported as best ones

    I used hgboost for optimizing the hyper-parameters of my XGBoost model as described in the API References with the following parameters:

    hgb = hgboost()
    results = hgb.xgboost(X_train, y_train, pos_label=1, method='xgb_clf', eval_metric='logloss')
    

    As noted in the documentation, results is a dictionary that, among other things, returns the best performing parameters (best_params) and the best performing model (model). However, the parameters that the best performing model uses are different from what the function returns as best_params:

    best_params

    'params': {'colsample_bytree': 0.47000000000000003,
      'gamma': 1,
      'learning_rate': 534,
      'max_depth': 49,
      'min_child_weight': 3.0,
      'n_estimators': 36,
      'subsample': 0.96}
    

    model

    'model': XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
                   colsample_bynode=1, colsample_bytree=0.47000000000000003,
                   enable_categorical=False, gamma=1, gpu_id=-1,
                   importance_type=None, interaction_constraints='',
                   learning_rate=0.058619090164329916, max_delta_step=0,
                   max_depth=54, min_child_weight=3.0, missing=nan,
                   monotone_constraints='()', n_estimators=200, n_jobs=-1,
                   num_parallel_tree=1, predictor='auto', random_state=0,
                   reg_alpha=0, reg_lambda=1, scale_pos_weight=0.5769800646551724,
                   subsample=0.96, tree_method='exact', validate_parameters=1,
                   verbosity=0),
    

    As you can see, for example, max_depth=49 in the best_params, but the model uses max_depth=54 etc.

    Is this a bug or the intended behavior? In case of the latter, I'd really appreciate an explanation!

    My setup:

    • OS: WSL (Ubuntu)
    • Python: 3.9.7
    • hgboost: 1.0.0
    opened by Mikki99 1
  • Running regression example error

    Running regression example error

    opened by recherHE 1
  • Error in rmse calculaiton

    Error in rmse calculaiton

    if self.eval_metric=='rmse':
                    loss = mean_squared_error(y_test, y_pred)
    

    mean_squared_error in sklearn gives mse, use mean_squared_error(y_true, y_pred, squared=False) for rmse

    opened by SSLPP 1
  • numpy.AxisError: axis 1 is out of bounds for array of dimension 1

    numpy.AxisError: axis 1 is out of bounds for array of dimension 1

    When eval_metric is auc, it raises an error. The related line is hgboost.py:906 and the related issue is: https://stackoverflow.com/questions/61288972/axiserror-axis-1-is-out-of-bounds-for-array-of-dimension-1-when-calculating-auc

    opened by quancore 0
  • ValueError: Target is multiclass but average='binary'. Please choose another average setting, one of [None, 'micro', 'macro', 'weighted'].

    ValueError: Target is multiclass but average='binary'. Please choose another average setting, one of [None, 'micro', 'macro', 'weighted'].

    There is an error when f1 score is used for multı-class classification. The error of line is on hgboost.py:904 while calculating f1 score, average param default is binary which is not suitable for multi-class.

    opened by quancore 0
Releases(1.1.3)
Transform ML models into a native code with zero dependencies

m2cgen (Model 2 Code Generator) - is a lightweight library which provides an easy way to transpile trained statistical models into a native code

Bayes' Witnesses 2.3k Jan 03, 2023
AutoX是一个高效的自动化机器学习工具,它主要针对于表格类型的数据挖掘竞赛。 它的特点包括: 效果出色、简单易用、通用、自动化、灵活。

English | 简体中文 AutoX是什么? AutoX一个高效的自动化机器学习工具,它主要针对于表格类型的数据挖掘竞赛。 它的特点包括: 效果出色: AutoX在多个kaggle数据集上,效果显著优于其他解决方案(见效果对比)。 简单易用: AutoX的接口和sklearn类似,方便上手使用。

4Paradigm 431 Dec 28, 2022
Mixing up the Invariant Information clustering architecture, with self supervised concepts from SimCLR and MoCo approaches

Self Supervised clusterer Combined IIC, and Moco architectures, with some SimCLR notions, to get state of the art unsupervised clustering while retain

Bendidi Ihab 9 Feb 13, 2022
A complete guide to start and improve in machine learning (ML)

A complete guide to start and improve in machine learning (ML), artificial intelligence (AI) in 2021 without ANY background in the field and stay up-to-date with the latest news and state-of-the-art

Louis-François Bouchard 3.3k Jan 04, 2023
whylogs: A Data and Machine Learning Logging Standard

whylogs: A Data and Machine Learning Logging Standard whylogs is an open source standard for data and ML logging whylogs logging agent is the easiest

WhyLabs 2k Jan 06, 2023
Machine Learning for Time-Series with Python.Published by Packt

Machine-Learning-for-Time-Series-with-Python Become proficient in deriving insights from time-series data and analyzing a model’s performance Links Am

Packt 124 Dec 28, 2022
A Python toolkit for rule-based/unsupervised anomaly detection in time series

Anomaly Detection Toolkit (ADTK) Anomaly Detection Toolkit (ADTK) is a Python package for unsupervised / rule-based time series anomaly detection. As

Arundo Analytics 888 Dec 30, 2022
Avocado hass time series vs predict price

AVOCADO HASS TIME SERIES VÀ PREDICT PRICE Trước khi vào Heroku muốn giao diện đẹp mọi người chuyển giúp mình theo hình bên dưới https://avocado-hass.h

hieulmsc 3 Dec 18, 2021
Tools for mathematical optimization region

Tools for mathematical optimization region

林景 15 Nov 30, 2022
Examples and code for the Practical Machine Learning workshop series

Practical Machine Learning Workshop Series Practical Machine Learning for Quantitative Finance Post conference workshop at the WBS Spring Conference D

CompatibL 21 Jun 25, 2022
As we all know the BGMI Loot Crate comes with so many resources for the gamers, this ML Crate will be the hub of various ML projects which will be the resources for the ML enthusiasts! Open Source Program: SWOC 2021 and JWOC 2022.

Machine Learning Loot Crate 💻 🧰 🔴 Welcome contributors! As we all know the BGMI Loot Crate comes with so many resources for the gamers, this ML Cra

Abhishek Sharma 89 Dec 28, 2022
Python bindings for MPI

MPI for Python Overview Welcome to MPI for Python. This package provides Python bindings for the Message Passing Interface (MPI) standard. It is imple

MPI for Python 604 Dec 29, 2022
Convoys is a simple library that fits a few statistical model useful for modeling time-lagged conversions.

Convoys is a simple library that fits a few statistical model useful for modeling time-lagged conversions. There is a lot more info if you head over to the documentation. You can also take a look at

Better 240 Dec 26, 2022
Conducted ANOVA and Logistic regression analysis using matplot library to visualize the result.

Intro-to-Data-Science Conducted ANOVA and Logistic regression analysis. Project ANOVA The main aim of this project is to perform One-Way ANOVA analysi

Chris Yuan 1 Feb 06, 2022
A series of Jupyter notebooks that walk you through the fundamentals of Machine Learning and Deep Learning in Python using Scikit-Learn, Keras and TensorFlow 2.

Machine Learning Notebooks, 3rd edition This project aims at teaching you the fundamentals of Machine Learning in python. It contains the example code

Aurélien Geron 1.6k Jan 05, 2023
A machine learning toolkit dedicated to time-series data

tslearn The machine learning toolkit for time series analysis in Python Section Description Installation Installing the dependencies and tslearn Getti

2.3k Jan 05, 2023
Banpei is a Python package of the anomaly detection.

Banpei Banpei is a Python package of the anomaly detection. Anomaly detection is a technique used to identify unusual patterns that do not conform to

Hirofumi Tsuruta 282 Jan 03, 2023
MLOps pipeline project using Amazon SageMaker Pipelines

This project shows steps to build an end to end MLOps architecture that covers data prep, model training, realtime and batch inference, build model registry, track lineage of artifacts and model drif

AWS Samples 3 Sep 16, 2022
A Tools that help Data Scientists and ML engineers train and deploy ML models.

Domino Research This repo contains projects under active development by the Domino R&D team. We build tools that help Data Scientists and ML engineers

Domino Data Lab 73 Oct 17, 2022
Diabetes Prediction with Logistic Regression

Diabetes Prediction with Logistic Regression Exploratory Data Analysis Data Preprocessing Model & Prediction Model Evaluation Model Validation: Holdou

AZİZE SULTAN PALALI 2 Oct 23, 2021