Bonsai: Gradient Boosted Trees + Bayesian Optimization

Last update: Oct 27, 2022

Related tags

Overview

Bonsai: Gradient Boosted Trees + Bayesian Optimization

Bonsai is a wrapper for the XGBoost and Catboost model training pipelines that leverages Bayesian optimization for computationally efficient hyperparameter tuning.

Despite being a very small package, it has access to nearly all of the configurable parameters in XGBoost and CatBoost as well as the BayesianOptimization package allowing users to specify unique objectives, metrics, parameter search ranges, and search policies. This is made possible thanks to the strong similarities between both libraries.

$ pip install bonsai-tree

References/Dependencies:

Why use Bonsai?

Grid search and random search are the most commonly used algorithms for exploring the hyperparameter space for a wide range of machine learning models. While effective for optimizing over low dimensional hyperparameter spaces (ex: few regularization terms), these methods do not scale well to models with a large number of hyperparameters such as gradient boosted trees.

Bayesian optimization on the other hand dynamically samples from the hyperparameter space with the goal of minimizing uncertaintly about the underlying objective function. For the case of model optimization, this consists of iteratively building a prior distribution of functions over the hyperparameter space and sampling with the goal of minimizing the posterior variance of the loss surface (via Gaussian Processes).

Model Configuration

Since Bonsai is simply a wrapper for both XGBoost and CatBoost, the model_params dict is synonymous with the params argument for both catboost.fit() and xgboost.fit(). Additionally, you must encode your categorical features as usual depending on which library you are using (XGB: One-Hot, CB: Label).

Below is a simple example of binary classification using CatBoost:

# label encoded training data
X = train.drop(target, axis = 1)
y = train[target]

# same args as catboost.train(...)
model_params = dict(objective = 'Logloss', verbose = False)

# same args as catboost.cv(...)
cv_params = dict(nfold = 5)

The pbounds dict as seen below specifies the hyperparameter bounds over which the optimizer will search. Additionally, the opt_config dictionary is for configuring the optimizer itself. Refer to the BayesianOptimization documentation to learn more.

# defining parameter search ranges
pbounds = dict(
  eta = (0.15, 0.4), 
  n_estimators = (200,2000), 
  max_depth = (4, 8)
)

# 10 warm up samples + 10 optimizing steps
n_iter, init_points= 10, 10

# to learn more about customizing your search policy:
# BayesianOptimization/examples/exploitation_vs_exploration.ipynb
opt_config = dict(acq = 'ei', xi = 1e-2)

Tuning and Prediction

All that is left is to initialize and optimize.

from bonsai.tune import CB_Tuner

# note that 'cats' is a list of categorical feature names
tuner = CB_Tuner(X, y, cats, model_params, cv_params, pbounds)
tuner.optimize(n_iter, init_points, opt_config, bounds_transformer)

After the optimal parameters are found, the model is trained and stored internally giving full access to the CatBoost model.

test_pool = catboost.Pool(test, cat_features = cats)
preds = tuner.model.predict(test_pool, prediction_type = 'Probability')

Bonsai also comes with a parallel coordinates plotting functionality allowing users to further narrow down their parameter search ranges as needed.

from bonsai.utils import parallel_coordinates

# DataFrame with hyperparams and observed loss
results = tuner.opt_results
parallel_coordinates(results)

Bonsai: Gradient Boosted Trees + Bayesian Optimization

Related tags

Overview

Bonsai: Gradient Boosted Trees + Bayesian Optimization

Why use Bonsai?

Model Configuration

Tuning and Prediction

Owner

CVXPY is a Python-embedded modeling language for convex optimization problems.

This is an implementation of the proximal policy optimization algorithm for the C++ API of Pytorch

Banpei is a Python package of the anomaly detection.

ArviZ is a Python package for exploratory analysis of Bayesian models

Iris-Heroku - Putting a Machine Learning Model into Production with Flask and Heroku

An MLOps framework to package, deploy, monitor and manage thousands of production machine learning models

nn-Meter is a novel and efficient system to accurately predict the inference latency of DNN models on diverse edge devices

Machine learning model evaluation made easy: plots, tables, HTML reports, experiment tracking and Jupyter notebook analysis.

Combines MLflow with a database (PostgreSQL) and a reverse proxy (NGINX) into a multi-container Docker application

MIT-Machine Learning with Python–From Linear Models to Deep Learning

Lightweight Machine Learning Experiment Logging 📖

2D fluid simulation implementation of Jos Stam paper on real-time fuild dynamics, including some suggested extensions.

My project contrasts K-Nearest Neighbors and Random Forrest Regressors on Real World data

Kats is a toolkit to analyze time series data, a lightweight, easy-to-use, and generalizable framework to perform time series analysis.

A comprehensive set of fairness metrics for datasets and machine learning models, explanations for these metrics, and algorithms to mitigate bias in datasets and models.

Quantum Machine Learning

Flightfare-Prediction - It is a Flightfare Prediction Web Application Using Machine learning,Python and flask

AutoX是一个高效的自动化机器学习工具，它主要针对于表格类型的数据挖掘竞赛。它的特点包括: 效果出色、简单易用、通用、自动化、灵活。

A repository for collating all the resources such as articles, blogs, papers, and books related to Bayesian Statistics.

Python package for causal inference using Bayesian structural time-series models.

Bonsai: Gradient Boosted Trees + Bayesian Optimization

Related tags

Overview

Bonsai: Gradient Boosted Trees + Bayesian Optimization

Why use Bonsai?

Model Configuration

Tuning and Prediction

Owner

CVXPY is a Python-embedded modeling language for convex optimization problems.

This is an implementation of the proximal policy optimization algorithm for the C++ API of Pytorch

Banpei is a Python package of the anomaly detection.

ArviZ is a Python package for exploratory analysis of Bayesian models

Iris-Heroku - Putting a Machine Learning Model into Production with Flask and Heroku

An MLOps framework to package, deploy, monitor and manage thousands of production machine learning models

nn-Meter is a novel and efficient system to accurately predict the inference latency of DNN models on diverse edge devices

Machine learning model evaluation made easy: plots, tables, HTML reports, experiment tracking and Jupyter notebook analysis.

Combines MLflow with a database (PostgreSQL) and a reverse proxy (NGINX) into a multi-container Docker application

MIT-Machine Learning with Python–From Linear Models to Deep Learning

Lightweight Machine Learning Experiment Logging 📖

2D fluid simulation implementation of Jos Stam paper on real-time fuild dynamics, including some suggested extensions.

My project contrasts K-Nearest Neighbors and Random Forrest Regressors on Real World data

Kats is a toolkit to analyze time series data, a lightweight, easy-to-use, and generalizable framework to perform time series analysis.

A comprehensive set of fairness metrics for datasets and machine learning models, explanations for these metrics, and algorithms to mitigate bias in datasets and models.

Quantum Machine Learning

Flightfare-Prediction - It is a Flightfare Prediction Web Application Using Machine learning,Python and flask

AutoX是一个高效的自动化机器学习工具，它主要针对于表格类型的数据挖掘竞赛。 它的特点包括: 效果出色、简单易用、通用、自动化、灵活。

A repository for collating all the resources such as articles, blogs, papers, and books related to Bayesian Statistics.

Python package for causal inference using Bayesian structural time-series models.

AutoX是一个高效的自动化机器学习工具，它主要针对于表格类型的数据挖掘竞赛。它的特点包括: 效果出色、简单易用、通用、自动化、灵活。