UpliftML: A Python Package for Scalable Uplift Modeling

Overview

UpliftML: A Python Package for Scalable Uplift Modeling

upliftml

UpliftML is a Python package for scalable unconstrained and constrained uplift modeling from experimental data. To accommodate working with big data, the package uses PySpark and H2O models as base learners for the uplift models. Evaluation functions expect a PySpark dataframe as input.

Uplift modeling is a family of techniques for estimating the Conditional Average Treatment Effect (CATE) from experimental or observational data using machine learning. In particular, we are interested in estimating the causal effect of a treatment T on the outcome Y of an individual characterized by features X. In experimental data with binary treatments and binary outcomes, this is equivalent to estimating Pr(Y=1 | T=1, X=x) - Pr(Y=1 | T=0, X=x).

In many practical use cases the goal is to select which users to target in order to maximize the overall uplift without exceeding a specified budget or ROI constraint. In those cases, estimating uplift alone is not sufficient to make optimal decisions and we need to take into account the costs and monetary benefit incurred by the treatment.

Uplift modeling is an emerging tool for various personalization applications. Example use cases include marketing campaigns personalization and optimization, personalized pricing in e-commerce, and clinical treatment personalization.

The UpliftML library includes PySpark/H2O implementations for the following:

  • 6 metalearner approaches for uplift modeling: T-learner[1], S-learner[1], X-learner[1], R-learner[2], class variable transformation[3], transformed outcome approach[4].
  • The Retrospective Estimation[5] technique for uplift modeling under ROI constraints.
  • Uplift and iROI-based evaluation and plotting functions with bootstrapped confidence intervals. Currently implemented: ATE, ROI, iROI, CATE per category/quantile, CATE lift, Qini/AUUC curves[6], Qini/AUUC score[6], cumulative iROI curves.

For detailed information about the package, read the UpliftML documentation.

Installation

Install the latest release from PyPI:

$ pip install upliftml

Quick Start

from upliftml.models.pyspark import TLearnerEstimator
from upliftml.evaluation import estimate_and_plot_qini
from upliftml.datasets import simulate_randomized_trial
from pyspark.ml.classification import LogisticRegression


# Read/generate the dataset and convert it to Spark if needed
df_pd = simulate_randomized_trial(n=2000, p=6, sigma=1.0, binary_outcome=True)
df_spark = spark.createDataFrame(df_pd)

# Split the data into train, validation, and test sets
df_train, df_val, df_test = df_spark.randomSplit([0.5, 0.25, 0.25])

# Preprocess the datasets (for implementation of get_features_vector, see the full example notebook)
num_features = [col for col in df_spark.columns if col.startswith('feature')]
cat_features = []
df_train_assembled = get_features_vector(df_train, num_features, cat_features)
df_val_assembled = get_features_vector(df_val, num_features, cat_features)
df_test_assembled = get_features_vector(df_test, num_features, cat_features)

# Build a two-model estimator
model = TLearnerEstimator(base_model_class=LogisticRegression,
                          base_model_params={'maxIter': 15},
                          predictors_colname='features',
                          target_colname='outcome',
                          treatment_colname='treatment',
                          treatment_value=1,
                          control_value=0)
model.fit(df_train_assembled, df_val_assembled)

# Apply the model to test data
df_test_eval = model.predict(df_test_assembled)

# Evaluate performance on the test set
qini_values, ax = estimate_and_plot_qini(df_test_eval)

For complete examples with more estimators and evaluation functions, see the demo notebooks in the examples folder.

Contributing

If interested in contributing to the package, get started by reading our contributor guidelines.

License

The project is licensed under Apache 2.0 License

Citation

If you use UpliftML, please cite it as follows:

Irene Teinemaa, Javier Albert, Nam Pham. UpliftML: A Python Package for Scalable Uplift Modeling. https://github.com/bookingcom/upliftml, 2021. Version 0.0.1.

@misc{upliftml,
  author={Irene Teinemaa, Javier Albert, Nam Pham},
  title={{UpliftML}: {A Python Package for Scalable Uplift Modeling}},
  howpublished={https://github.com/bookingcom/upliftml},
  note={Version 0.0.1},
  year={2021}
}

Resources

Documentation:

Tutorials and blog posts:

Related packages:

  • CausalML: a Python package for uplift modeling and causal inference with machine learning
  • EconML: a Python package for estimating heterogeneous treatment effects from observational data via machine learning

References

  1. Sören R. Künzel, Jasjeet S. Sekhon, Peter J. Bickel, and Bin Yu. Metalearners for estimating heterogeneous treatment effects using machine learning. Proceedings of the National Academy of Sciences, 2019.
  2. Xinkun Nie and Stefan Wager. Quasi-oracle estimation of heterogeneous treatment effects. arXiv preprint arXiv:1712.04912, 2017.
  3. Maciej Jaskowski and Szymon Jaroszewicz. Uplift modeling for clinical trial data. ICML Workshop on Clinical Data Analysis, 2012.
  4. Susan Athey and Guido W. Imbens. Machine learning methods for estimating heterogeneous causal effects. stat, 1050(5), 2015.
  5. Dmitri Goldenberg, Javier Albert, Lucas Bernardi, Pablo Estevez Castillo. Free Lunch! Retrospective Uplift Modeling for Dynamic Promotions Recommendation within ROI Constraints. In Fourteenth ACM Conference on Recommender Systems (pp. 486-491), 2020.
  6. Nicholas J Radcliffe and Patrick D Surry. Real-world uplift modelling with significance based uplift trees. White Paper tr-2011-1, Stochastic Solutions, 2011.
Owner
Booking.com
Open source projects and forks of projects we use internally (for better upstream collaboration)
Booking.com
Deep Survival Machines - Fully Parametric Survival Regression

Package: dsm Python package dsm provides an API to train the Deep Survival Machines and associated models for problems in survival analysis. The under

Carnegie Mellon University Auton Lab 10 Dec 30, 2022
Python package for stacking (machine learning technique)

vecstack Python package for stacking (stacked generalization) featuring lightweight functional API and fully compatible scikit-learn API Convenient wa

Igor Ivanov 671 Dec 25, 2022
GroundSeg Clustering Optimized Kdtree

ground seg and clustering based on kitti velodyne data, and a additional optimized kdtree for knn and radius nn search

2 Dec 02, 2021
neurodsp is a collection of approaches for applying digital signal processing to neural time series

neurodsp is a collection of approaches for applying digital signal processing to neural time series, including algorithms that have been proposed for the analysis of neural time series. It also inclu

NeuroDSP 224 Dec 02, 2022
Neighbourhood Retrieval (Nearest Neighbours) with Distance Correlation.

Neighbourhood Retrieval with Distance Correlation Assign Pseudo class labels to datapoints in the latent space. NNDC is a slim wrapper around FAISS. N

The Learning Machines 1 Jan 16, 2022
Backprop makes it simple to use, finetune, and deploy state-of-the-art ML models.

Backprop makes it simple to use, finetune, and deploy state-of-the-art ML models. Solve a variety of tasks with pre-trained models or finetune them in

Backprop 227 Dec 10, 2022
A Time Series Library for Apache Spark

Flint: A Time Series Library for Apache Spark The ability to analyze time series data at scale is critical for the success of finance and IoT applicat

Two Sigma 970 Jan 04, 2023
easyNeuron is a simple way to create powerful machine learning models, analyze data and research cutting-edge AI.

easyNeuron is a simple way to create powerful machine learning models, analyze data and research cutting-edge AI.

Neuron AI 5 Jun 18, 2022
PROTEIN EXPRESSION ANALYSIS FOR DOWN SYNDROME

PROTEIN-EXPRESSION-ANALYSIS-FOR-DOWN-SYNDROME Down syndrome (DS) is a chromosomal disorder where organisms have an extra chromosome 21, sometimes know

1 Jan 20, 2022
A linear regression model for house price prediction

Linear_Regression_Model A linear regression model for house price prediction. This code is using these packages, so please make sure your have install

ShawnWang 1 Nov 29, 2021
In this Repo a simple Sklearn Model will be trained and pushed to MLFlow

SKlearn_to_MLFLow In this Repo a simple Sklearn Model will be trained and pushed to MLFlow Install This Repo is based on poetry python3 -m venv .venv

1 Dec 13, 2021
A statistical library designed to fill the void in Python's time series analysis capabilities, including the equivalent of R's auto.arima function.

pmdarima Pmdarima (originally pyramid-arima, for the anagram of 'py' + 'arima') is a statistical library designed to fill the void in Python's time se

alkaline-ml 1.3k Jan 06, 2023
Tribuo - A Java machine learning library

Tribuo - A Java prediction library (v4.1) Tribuo is a machine learning library in Java that provides multi-class classification, regression, clusterin

Oracle 1.1k Dec 28, 2022
2D fluid simulation implementation of Jos Stam paper on real-time fuild dynamics, including some suggested extensions.

Fluid Simulation Usage Download this repo and store it in your computer. Open a terminal and go to the root directory of this folder. Make sure you ha

Mariana Ávalos Arce 5 Dec 02, 2022
Bodywork deploys machine learning projects developed in Python, to Kubernetes.

Bodywork deploys machine learning projects developed in Python, to Kubernetes. It helps you to: serve models as microservices execute batch jobs run r

Bodywork Machine Learning 409 Jan 01, 2023
Simple data balancing baselines for worst-group-accuracy benchmarks.

BalancingGroups Code to replicate the experimental results from Simple data balancing baselines achieve competitive worst-group-accuracy. Replicating

Facebook Research 29 Dec 02, 2022
Automated machine learning: Review of the state-of-the-art and opportunities for healthcare

Automated machine learning: Review of the state-of-the-art and opportunities for healthcare

42 Dec 23, 2022
Bonsai: Gradient Boosted Trees + Bayesian Optimization

Bonsai is a wrapper for the XGBoost and Catboost model training pipelines that leverages Bayesian optimization for computationally efficient hyperparameter tuning.

24 Oct 27, 2022
Flightfare-Prediction - It is a Flightfare Prediction Web Application Using Machine learning,Python and flask

Flight_fare-Prediction It is a Flight_fare Prediction Web Application Using Machine learning,Python and flask Using Machine leaning i have created a F

1 Dec 06, 2022