MooGBT is a library for Multi-objective optimization in Gradient Boosted Trees.

Overview

Multi-objective Optimized GBT(MooGBT)

MooGBT is a library for Multi-objective optimization in Gradient Boosted Trees. MooGBT optimizes for multiple objectives by defining constraints on sub-objective(s) along with a primary objective. The constraints are defined as upper bounds on sub-objective loss function. MooGBT uses a Augmented Lagrangian(AL) based constrained optimization framework with Gradient Boosted Trees, to optimize for multiple objectives.

With AL, we introduce dual variables in Boosting. The dual variables are iteratively optimized and fit within the Boosting iterations. The Boosting objective function is updated with the AL terms and the gradient is readily derived using the GBT gradients. With the gradient and updates of dual variables, we solve the optimization problem by jointly iterating AL and Boosting steps.

This library is motivated by work done in the paper Multi-objective Relevance Ranking, which introduces an Augmented Lagrangian based method to incorporate multiple objectives (MO) in LambdaMART, which is a GBT based search ranking algorithm.

We have modified the scikit-learn GBT implementation [3] to support multi-objective optimization.

Highlights -

  • follows the scikit-learn API conventions
  • supports all hyperparameters present in scikit-learn GBT
  • supports optimization for more than 1 sub-objectives

  • Current support -

  • MooGBTClassifier - "binomial deviance" loss function, for primary and sub-objectives represented as binary variables
  • MooGBTRegressor - "least squares" loss function, for primary and sub-objectives represented as continuous variables

  • Installation

    Moo-GBT can be installed from PyPI

    pip3 install moo-gbt

    Usage

    from multiobjective_gbt import MooGBTClassifier
    
    mu = 100
    b = 0.7 # upper bound on sub-objective cost
    
    constrained_gbt = MooGBTClassifier(
    				loss='deviance',
    				n_estimators=100,
    				constraints=[{"mu":mu, "b":b}], # One Constraint
    				random_state=2021
    )
    constrained_gbt.fit(X_train, y_train)

    Here y_train contains 2 columns, the first column should be the primary objective. The following columns are all the sub-objectives for which constraints have been specified(in the same order).


    Usage Steps

    1. Run unconstrained GBT on Primary Objective. Unconstrained GBT is just the GBTClassifer/GBTRegressor by scikit-learn
    2. Calculate the loss function value for Primary Objective and sub-objective(s)
      • For MooGBTClassifier calculate Log Loss between predicted probability and sub-objective label(s)
      • For MooGBTRegressor calculate mean squared error between predicted value and sub-objective label(s)
    3. Set the value of hyperparamter b, less than the calculated cost in the previous step and run MooGBTClassifer/MooGBTRegressor with this b. The lower the value of b, the more the sub-objective will be optimized

    Example with multiple binary objectives

    import pandas as pd
    import numpy as np
    import seaborn as sns
    
    from multiobjective_gbt import MooGBTClassifier

    We'll use a publicly available dataset - available here

    We define a multi-objective problem on the dataset, with the primary objective as the column "is_booking" and sub-objective as the column "is_package". Both these variables are binary.

    # Preprocessing Data
    train_data = pd.read_csv('examples/expedia-data/expedia-hotel-recommendations/train_data_sample.csv')
    
    po = 'is_booking' # primary objective
    so = 'is_package' # sub-objective
    
    features =  list(train_data.columns)
    features.remove(po)
    outcome_flag =  po
    
    # Train-Test Split
    X_train, X_test, y_train, y_test = train_test_split(
    					train_data[features],
    					train_data[outcome_flag],
    					test_size=0.2,
    					stratify=train_data[[po, so]],
    					random_state=2021
    )
    
    # Creating y_train_, y_test_ with 2 labels
    y_train_ = pd.DataFrame()
    y_train_[po] = y_train
    y_train_[so] = X_train[so]
    
    y_test_ = pd.DataFrame()
    y_test_[po] = y_test
    y_test_[so] = X_test[so]

    MooGBTClassifier without the constraint parameter, works as the standard scikit-learn GBT classifier.

    unconstrained_gbt = MooGBTClassifier(
    				loss='deviance',
    				n_estimators=100,
    				random_state=2021
    )
    
    unconstrained_gbt.fit(X_train, y_train)

    Get train and test sub-objective costs for unconstrained model.

    def get_binomial_deviance_cost(pred, y):
    	return -np.mean(y * np.log(pred) + (1-y) * np.log(1-pred))
    
    pred_train = unconstrained_gbt.predict_proba(X_train)[:,1]
    pred_test = unconstrained_gbt.predict_proba(X_test)[:,1]
    
    # get sub-objective costs
    so_train_cost = get_binomial_deviance_cost(pred_train, X_train[so])
    so_test_cost = get_binomial_deviance_cost(pred_test, X_test[so])
    
    print (f"""
    Sub-objective cost train - {so_train_cost},
    Sub-objective cost test  - {so_test_cost}
    """)
    Sub-objective cost train - 0.9114,
    Sub-objective cost test  - 0.9145
    

    Constraint is specified as an upper bound on the sub-objective cost. In the unconstrained model, we see the cost of our sub-objective to be ~0.9. So setting upper bounds below 0.9 would optimise the sub-objective.

    b = 0.65 # upper bound on cost
    mu = 100
    constrained_gbt = MooGBTClassifier(
    				loss='deviance',
    				n_estimators=100,
    				constraints=[{"mu":mu, "b":b}], # One Constraint
    				random_state=2021
    )
    
    constrained_gbt.fit(X_train, y_train_)

    From the constrained model, we achieve more than 100% gain in AuROC for the sub-objective while the loss in primary objective AuROC is kept within 6%. The entire study on this dataset can be found in the example notebook.

    Looking at MooGBT primary and sub-objective losses -

    To get raw values of loss functions wrt boosting iteration,

    # return a Pandas dataframe with loss values of objectives wrt boosting iteration
    losses = constrained_gbt.loss_.get_losses()
    losses.head()

    Similarly, you can also look at dual variable(alpha) values for sub-objective(s),

    To get raw values of alphas wrt boosting iteration,

    constrained_gbt.loss_.get_alphas()

    These losses can be used to look at the MooGBT Learning process.

    sns.lineplot(data=losses, x='n_estimators', y='primary_objective', label='primary objective')
    sns.lineplot(data=losses, x='n_estimators', y='sub_objective_1', label='subobjective')
    
    plt.xlabel("# estimators(trees)")
    plt.ylabel("Cost")
    plt.legend(loc = "upper right")

    sns.lineplot(data=losses, x='n_estimators', y='primary_objective', label='primary objective')

    Choosing the right upper bound constraint b and mu value

    The upper bound should be defined based on a acceptable % loss in the primary objective evaluation metric. For stricter upper bounds, this loss would be greater as MooGBT will optimize for the sub-objective more.

    Below table summarizes the effect of the upper bound value on the model performance for primary and sub-objective(s) for the above example.

    %gain specifies the percentage increase in AUROC for the constrained MooGBT model from an uncostrained GBT model.

    b Primary Objective - %gain Sub-Objective - %gain
    0.9 -0.7058 4.805
    0.8 -1.735 40.08
    0.7 -2.7852 62.7144
    0.65 -5.8242 113.9427
    0.6 -9.9137 159.8931

    In general, across our experiments we have found that lower values of mu optimize on the primary objective better while satisfying the sub-objective constraints given enough boosting iterations(n_estimators).

    The below table summarizes the results of varying mu values keeping the upper bound same(b=0.6).

    b mu Primary Objective - %gain Sub-objective - %gain
    0.6 1000 -20.6569 238.1388
    0.6 100 -13.3769 197.8186
    0.6 10 -9.9137 159.8931
    0.6 5 -8.643 146.4171

    MooGBT Learning Process

    MooGBT optimizes for multiple objectives by defining constraints on sub-objective(s) along with a primary objective. The constraints are defined as upper bounds on sub-objective loss function.

    MooGBT differs from a standard GBT in the loss function it optimizes the primary objective C1 and the sub-objectives using the Augmented Lagrangian(AL) constrained optimization approach.

    where α = [α1, α2, α3…..] is a vector of dual variables. The Lagrangian is solved by minimizing with respect to the primal variables "s" and maximizing with respect to the dual variables α. Augmented Lagrangian iteratively solves the constraint optimization. Since AL is an iterative approach we integerate it with the boosting iterations of GBT for updating the dual variable α.

    Alpha(α) update -

    At an iteration k, if the constraint t is not satisfied, i.e., Ct(s) > bt, we have  αtk > αtk-1. Otherwise, if the constraint is met, the dual variable α is made 0.

    Public contents

    • _gb.py: contains the MooGBTClassifier and MooGBTRegressor classes. Contains implementation of the fit and predict function. Extended implementation from _gb.py from scikit-learn.

    • _gb_losses.py: contains BinomialDeviance loss function class, LeastSquares loss function class. Extended implementation from _gb_losses.py from scikit-learn.

    More examples

    The examples directory contains several illustrations of how one can use this library:

    References - 

    [1] Multi-objective Ranking via Constrained Optimization - https://arxiv.org/pdf/2002.05753.pdf
    [2] Multi-objective Relevance Ranking - https://sigir-ecom.github.io/ecom2019/ecom19Papers/paper30.pdf
    [3] Scikit-learn GBT Implementation - GBTClassifier and GBTRegressor

    Owner
    Swiggy
    Swiggy
    Probabilistic programming framework that facilitates objective model selection for time-varying parameter models.

    Time series analysis today is an important cornerstone of quantitative science in many disciplines, including natural and life sciences as well as eco

    Christoph Mark 129 Dec 24, 2022
    Tribuo - A Java machine learning library

    Tribuo - A Java prediction library (v4.1) Tribuo is a machine learning library in Java that provides multi-class classification, regression, clusterin

    Oracle 1.1k Dec 28, 2022
    Avocado hass time series vs predict price

    AVOCADO HASS TIME SERIES VÀ PREDICT PRICE Trước khi vào Heroku muốn giao diện đẹp mọi người chuyển giúp mình theo hình bên dưới https://avocado-hass.h

    hieulmsc 3 Dec 18, 2021
    PySurvival is an open source python package for Survival Analysis modeling

    PySurvival What is Pysurvival ? PySurvival is an open source python package for Survival Analysis modeling - the modeling concept used to analyze or p

    Square 265 Dec 27, 2022
    scikit-fem is a lightweight Python 3.7+ library for performing finite element assembly.

    scikit-fem is a lightweight Python 3.7+ library for performing finite element assembly. Its main purpose is the transformation of bilinear forms into sparse matrices and linear forms into vectors.

    Tom Gustafsson 297 Dec 13, 2022
    A python library for Bayesian time series modeling

    PyDLM Welcome to pydlm, a flexible time series modeling library for python. This library is based on the Bayesian dynamic linear model (Harrison and W

    Sam 438 Dec 17, 2022
    Pragmatic AI Labs 421 Dec 31, 2022
    Apple-voice-recognition - Machine Learning

    Apple-voice-recognition Machine Learning How does Siri work? Siri is based on large-scale Machine Learning systems that employ many aspects of data sc

    Harshith VH 1 Oct 22, 2021
    a distributed deep learning platform

    Apache SINGA Distributed deep learning system http://singa.apache.org Quick Start Installation Examples Issues JIRA tickets Code Analysis: Mailing Lis

    The Apache Software Foundation 2.7k Jan 05, 2023
    A Python implementation of GRAIL, a generic framework to learn compact time series representations.

    GRAIL A Python implementation of GRAIL, a generic framework to learn compact time series representations. Requirements Python 3.6+ numpy scipy tslearn

    3 Nov 24, 2021
    Simple Machine Learning Tool Kit

    Getting started smltk (Simple Machine Learning Tool Kit) package is implemented for helping your work during data preparation testing your model The g

    Alessandra Bilardi 1 Dec 30, 2021
    An MLOps framework to package, deploy, monitor and manage thousands of production machine learning models

    Seldon Core: Blazing Fast, Industry-Ready ML An open source platform to deploy your machine learning models on Kubernetes at massive scale. Overview S

    Seldon 3.5k Jan 01, 2023
    Repository for DCA0305, an undergraduate course about Machine Learning Workflows and Pipelines

    Federal University of Rio Grande do Norte Technology Center Department of Computer Engineering and Automation Machine Learning Based Systems Design Re

    Ivanovitch Silva 81 Oct 18, 2022
    A Python toolkit for rule-based/unsupervised anomaly detection in time series

    Anomaly Detection Toolkit (ADTK) Anomaly Detection Toolkit (ADTK) is a Python package for unsupervised / rule-based time series anomaly detection. As

    Arundo Analytics 888 Dec 30, 2022
    A logistic regression model for health insurance purchasing prediction

    Logistic_Regression_Model A logistic regression model for health insurance purchasing prediction This code is using these packages, so please make sur

    ShawnWang 1 Nov 29, 2021
    机器学习检测webshell

    ai-webshell-detect 机器学习检测webshell,利用textcnn+简单二分类网络,基于keras,花了七天 检测原理: 从文件熵 文件长度 文件语句提取出特征,然后文件熵与长度送入二分类网络,文件语句送入textcnn 项目原理,介绍,怎么做出来的

    Huoji's 56 Dec 14, 2022
    Distributed training framework for TensorFlow, Keras, PyTorch, and Apache MXNet.

    Horovod Horovod is a distributed deep learning training framework for TensorFlow, Keras, PyTorch, and Apache MXNet. The goal of Horovod is to make dis

    Horovod 12.9k Jan 07, 2023
    A statistical library designed to fill the void in Python's time series analysis capabilities, including the equivalent of R's auto.arima function.

    pmdarima Pmdarima (originally pyramid-arima, for the anagram of 'py' + 'arima') is a statistical library designed to fill the void in Python's time se

    alkaline-ml 1.3k Jan 06, 2023
    cleanlab is the data-centric ML ops package for machine learning with noisy labels.

    cleanlab is the data-centric ML ops package for machine learning with noisy labels. cleanlab cleans labels and supports finding, quantifying, and lear

    Cleanlab 51 Nov 28, 2022