pure-predict: Machine learning prediction in pure Python

Overview
pure-predict

pure-predict: Machine learning prediction in pure Python

License Build Status PyPI Package Downloads Python Versions

pure-predict speeds up and slims down machine learning prediction applications. It is a foundational tool for serverless inference or small batch prediction with popular machine learning frameworks like scikit-learn and fasttext. It implements the predict methods of these frameworks in pure Python.

Primary Use Cases

The primary use case for pure-predict is the following scenario:

  1. A model is trained in an environment without strong container footprint constraints. Perhaps a long running "offline" job on one or many machines where installing a number of python packages from PyPI is not at all problematic.
  2. At prediction time the model needs to be served behind an API. Typical access patterns are to request a prediction for one "record" (one "row" in a numpy array or one string of text to classify) per request or a mini-batch of records per request.
  3. Preferred infrastructure for the prediction service is either serverless (AWS Lambda) or a container service where the memory footprint of the container is constrained.
  4. The fitted model object's artifacts needed for prediction (coefficients, weights, vocabulary, decision tree artifacts, etc.) are relatively small (10s to 100s of MBs).
diagram

In this scenario, a container service with a large dependency footprint can be overkill for a microservice, particularly if the access patterns favor the pricing model of a serverless application. Additionally, for smaller models and single record predictions per request, the numpy and scipy functionality in the prediction methods of popular machine learning frameworks work against the application in terms of latency, underperforming pure python in some cases.

Check out the blog post for more information on the motivation and use cases of pure-predict.

Package Details

It is a Python package for machine learning prediction distributed under the Apache 2.0 software license. It contains multiple subpackages which mirror their open source counterpart (scikit-learn, fasttext, etc.). Each subpackage has utilities to convert a fitted machine learning model into a custom object containing prediction methods that mirror their native counterparts, but converted to pure python. Additionally, all relevant model artifacts needed for prediction are converted to pure python.

A pure-predict model object can then be pickled and later unpickled without any 3rd party dependencies other than pure-predict.

This eliminates the need to have large dependency packages installed in order to make predictions with fitted machine learning models using popular open source packages for training models. These dependencies (numpy, scipy, scikit-learn, fasttext, etc.) are large in size and not always necessary to make fast and accurate predictions. Additionally, they rely on C extensions that may not be ideal for serverless applications with a python runtime.

Quick Start Example

In a python enviornment with scikit-learn and its dependencies installed:

import pickle

from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import load_iris
from pure_sklearn.map import convert_estimator

# fit sklearn estimator
X, y = load_iris(return_X_y=True)
clf = RandomForestClassifier()
clf.fit(X, y)

# convert to pure python estimator
clf_pure_predict = convert_estimator(clf)
with open("model.pkl", "wb") as f:
    pickle.dump(clf_pure_predict, f)

# make prediction with sklearn estimator
y_pred = clf.predict([[0.25, 2.0, 8.3, 1.0]])
print(y_pred)
[2]

In a python enviornment with only pure-predict installed:

import pickle

# load pickled model
with open("model.pkl", "rb") as f:
    clf = pickle.load(f)

# make prediction with pure-predict object
y_pred = clf.predict([[0.25, 2.0, 8.3, 1.0]])
print(y_pred)
[2]

Subpackages

pure_sklearn

Prediction in pure python for a subset of scikit-learn estimators and transformers.

  • estimators
    • linear models - supports the majority of linear models for classification
    • trees - decision trees, random forests, gradient boosting and xgboost
    • naive bayes - a number of popular naive bayes classifiers
    • svm - linear SVC
  • transformers
    • preprocessing - normalization and onehot/ordinal encoders
    • impute - simple imputation
    • feature extraction - text (tfidf, count vectorizer, hashing vectorizer) and dictionary vectorization
    • pipeline - pipelines and feature unions

Sparse data - supports a custom pure python sparse data object - sparse data is handled as would be expected by the relevent transformers and estimators

pure_fasttext

Prediction in pure python for fasttext.

  • supervised - predicts labels for supervised models; no support for quantized models (blocked by this issue)
  • unsupervised - lookup of word or sentence embeddings given input text

Installation

Dependencies

pure-predict requires:

Dependency Notes

  • pure_sklearn has been tested with scikit-learn versions >= 0.20 -- certain functionality may work with lower versions but are not guaranteed. Some functionality is explicitly not supported for certain scikit-learn versions and exceptions will be raised as appropriate.
  • xgboost requires version >= 0.82 for support with pure_sklearn.
  • pure-predict is not supported with Python 2.
  • fasttext versions <= 0.9.1 have been tested.

User Installation

The easiest way to install pure-predict is with pip:

pip install --upgrade pure-predict

You can also download the source code:

git clone https://github.com/Ibotta/pure-predict.git

Testing

With pytest installed, you can run tests locally:

pytest pure-predict

Examples

The package contains examples on how to use pure-predict in practice.

Calls for Contributors

Contributing to pure-predict is welcomed by any contributors. Specific calls for contribution are as follows:

  1. Examples, tests and documentation -- particularly more detailed examples with performance testing of various estimators under various constraints.
  2. Adding more pure_sklearn estimators. The scikit-learn package is extensive and only partially covered by pure_sklearn. Regression tasks in particular missing from pure_sklearn. Clustering, dimensionality reduction, nearest neighbors, feature selection, non-linear SVM, and more are also omitted and would be good candidates for extending pure_sklearn.
  3. General efficiency. There is likely low hanging fruit for improving the efficiency of the numpy and scipy functionality that has been ported to pure-predict.
  4. Threading could be considered to improve performance -- particularly for making predictions with multiple records.
  5. A public AWS lambda layer containing pure-predict.

Background

The project was started at Ibotta Inc. on the machine learning team and open sourced in 2020. It is currently maintained by the machine learning team at Ibotta.

Acknowledgements

Thanks to David Mitchell and Andrew Tilley for internal review before open source. Thanks to James Foley for logo artwork.

IbottaML
Owner
Ibotta
Ibotta
Price forecasting of SGB and IRFC Bonds and comparing there returns

Project_Bonds Project Title : Price forecasting of SGB and IRFC Bonds and comparing there returns. Introduction of the Project The 2008-09 global fina

Tishya S 1 Oct 28, 2021
A repository for collating all the resources such as articles, blogs, papers, and books related to Bayesian Statistics.

A repository for collating all the resources such as articles, blogs, papers, and books related to Bayesian Statistics.

Aayush Malik 80 Dec 12, 2022
Vowpal Wabbit is a machine learning system which pushes the frontier of machine learning with techniques

Vowpal Wabbit is a machine learning system which pushes the frontier of machine learning with techniques such as online, hashing, allreduce, reductions, learning2search, active, and interactive learn

Vowpal Wabbit 8.1k Dec 30, 2022
easyNeuron is a simple way to create powerful machine learning models, analyze data and research cutting-edge AI.

easyNeuron is a simple way to create powerful machine learning models, analyze data and research cutting-edge AI.

Neuron AI 5 Jun 18, 2022
A scikit-learn based module for multi-label et. al. classification

scikit-multilearn scikit-multilearn is a Python module capable of performing multi-label learning tasks. It is built on-top of various scientific Pyth

802 Jan 01, 2023
Model factory is a ML training platform to help engineers to build ML models at scale

Model Factory Machine learning today is powering many businesses today, e.g., search engine, e-commerce, news or feed recommendation. Training high qu

16 Sep 23, 2022
High performance implementation of Extreme Learning Machines (fast randomized neural networks).

High Performance toolbox for Extreme Learning Machines. Extreme learning machines (ELM) are a particular kind of Artificial Neural Networks, which sol

Anton Akusok 174 Dec 07, 2022
Houseprices - Predict sales prices and practice feature engineering, RFs, and gradient boosting

House Prices - Advanced Regression Techniques Predicting House Prices with Machine Learning This project is build to enhance my knowledge about machin

1 Jan 01, 2022
nn-Meter is a novel and efficient system to accurately predict the inference latency of DNN models on diverse edge devices

A DNN inference latency prediction toolkit for accurately modeling and predicting the latency on diverse edge devices.

Microsoft 241 Dec 26, 2022
Uplift modeling and causal inference with machine learning algorithms

Disclaimer This project is stable and being incubated for long-term support. It may contain new experimental code, for which APIs are subject to chang

Uber Open Source 3.7k Jan 07, 2023
A modular active learning framework for Python

Modular Active Learning framework for Python3 Page contents Introduction Active learning from bird's-eye view modAL in action From zero to one in a fe

modAL 1.9k Dec 31, 2022
A toolbox to iNNvestigate neural networks' predictions!

iNNvestigate neural networks! Table of contents Introduction Installation Usage and Examples More documentation Contributing Releases Introduction In

Maximilian Alber 1.1k Jan 05, 2023
2021 Machine Learning Security Evasion Competition

2021 Machine Learning Security Evasion Competition This repository contains code samples for the 2021 Machine Learning Security Evasion Competition. P

Fabrício Ceschin 8 May 01, 2022
Winning solution for the Galaxy Challenge on Kaggle

Winning solution for the Galaxy Challenge on Kaggle

Sander Dieleman 483 Jan 02, 2023
Covid-polygraph - a set of Machine Learning-driven fact-checking tools

Covid-polygraph, a set of Machine Learning-driven fact-checking tools that aim to address the issue of misleading information related to COVID-19.

1 Apr 22, 2022
High performance, easy-to-use, and scalable machine learning (ML) package, including linear model (LR), factorization machines (FM), and field-aware factorization machines (FFM) for Python and CLI interface.

What is xLearn? xLearn is a high performance, easy-to-use, and scalable machine learning package that contains linear model (LR), factorization machin

Chao Ma 3k Jan 08, 2023
Land Cover Classification Random Forest

You can perform Land Cover Classification on Satellite Images using Random Forest and visualize the result using Earthpy package. Make sure to install the required packages and such as

Dr. Sander Ali Khowaja 1 Jan 21, 2022
LibTraffic is a unified, flexible and comprehensive traffic prediction library based on PyTorch

LibTraffic is a unified, flexible and comprehensive traffic prediction library, which provides researchers with a credibly experimental tool and a convenient development framework. Our library is imp

432 Jan 05, 2023
machine learning model deployment project of Iris classification model in a minimal UI using flask web framework and deployed it in Azure cloud using Azure app service

This is a machine learning model deployment project of Iris classification model in a minimal UI using flask web framework and deployed it in Azure cloud using Azure app service. We initially made th

Krishna Priyatham Potluri 73 Dec 01, 2022
50% faster, 50% less RAM Machine Learning. Numba rewritten Sklearn. SVD, NNMF, PCA, LinearReg, RidgeReg, Randomized, Truncated SVD/PCA, CSR Matrices all 50+% faster

[Due to the time taken @ uni, work + hell breaking loose in my life, since things have calmed down a bit, will continue commiting!!!] [By the way, I'm

Daniel Han-Chen 1.4k Jan 01, 2023