naked is a Python tool which allows you to strip a model and only keep what matters for making predictions.

Related tags

Deep Learningnaked
Overview

naked

naked is a Python tool which allows you to strip a model and only keep what matters for making predictions. The result is a pure Python function with no third-party dependencies that you can simply copy/paste wherever you wish.

This is simpler than deploying an API endpoint or loading a serialized model. The jury is still out on whether this is sane or not. Of course I'm not the first one to have done this, for instance see sklearn-porter.

Installation

pip install git+https://github.com/MaxHalford/naked

Examples

sklearn.linear_model.LinearRegression

First, we fit a model.

import numpy as np
from sklearn.linear_model import LinearRegression

X = np.array([[1, 1], [1, 2], [2, 2], [2, 3]])
y = np.dot(X, np.array([1, 2])) + 3
lin_reg = LinearRegression().fit(X, y)
lin_reg.fit(X, y)

Then, we strip it.

import naked

print(naked.strip(lin_reg))

Which produces the following output.

def linear_regression(x):

    coef_ = [1.0000000000000002, 1.9999999999999991]
    intercept_ = 3.0000000000000018

    return intercept_ + sum(xi * wi for xi, wi in enumerate(coef_))

sklearn.pipeline.Pipeline

import naked
from sklearn import linear_model
from sklearn import feature_extraction
from sklearn import pipeline
from sklearn import preprocessing

model = pipeline.make_pipeline(
    feature_extraction.text.TfidfVectorizer(),
    preprocessing.Normalizer(),
    linear_model.LogisticRegression(solver='liblinear')
)

docs = ['Sad', 'Angry', 'Happy', 'Joyful']
is_positive = [False, False, True, True]

model.fit(docs, is_positive)

print(naked.strip(model))

This produces the following output.

def tfidf_vectorizer(x):

    lowercase = True
    norm = 'l2'
    vocabulary_ = {'sad': 3, 'angry': 0, 'happy': 1, 'joyful': 2}
    idf_ = [1.916290731874155, 1.916290731874155, 1.916290731874155, 1.916290731874155]

    import re

    if lowercase:
        x = x.lower()

    # Tokenize
    x = re.findall(r"(?u)\b\w\w+\b", x)
    x = [xi for xi in x if len(xi) > 1]

    # Count term frequencies
    from collections import Counter
    tf = Counter(x)
    total = sum(tf.values())

    # Compute the TF-IDF of each tokenized term
    tfidf = [0] * len(vocabulary_)
    for term, freq in tf.items():
        try:
            index = vocabulary_[term]
        except KeyError:
            continue
        tfidf[index] = freq * idf_[index] / total

    # Apply normalization
    if norm == 'l2':
        norm_val = sum(xi ** 2 for xi in tfidf) ** .5

    return [v / norm_val for v in tfidf]

def normalizer(x):

    norm = 'l2'

    if norm == 'l2':
        norm_val = sum(xi ** 2 for xi in x) ** .5
    elif norm == 'l1':
        norm_val = sum(abs(xi) for xi in x)
    elif norm == 'max':
        norm_val = max(abs(xi) for xi in x)

    return [xi / norm_val for xi in x]

def logistic_regression(x):

    coef_ = [[-0.40105811611957726, 0.40105811611957726, 0.40105811611957726, -0.40105811611957726]]
    intercept_ = [0.0]

    import math

    logits = [
        b + sum(xi * wi for xi, wi in zip(x, w))
        for w, b in zip(coef_, intercept_)
    ]

    # Sigmoid activation for binary classification
    if len(logits) == 1:
        p_true = 1 / (1 + math.exp(-logits[0]))
        return [1 - p_true, p_true]

    # Softmax activation for multi-class classification
    z_max = max(logits)
    exp = [math.exp(z - z_max) for z in logits]
    exp_sum = sum(exp)
    return [e / exp_sum for e in exp]

def pipeline(x):
    x = tfidf_vectorizer(x)
    x = normalizer(x)
    x = logistic_regression(x)
    return x

FAQ

What models are supported?

>>> import naked
>>> print(naked.AVAILABLE)
sklearn
    LinearRegression
    LogisticRegression
    Normalizer
    StandardScaler
    TfidfVectorizer

Will this work for all library versions?

Not by design. A release of naked is intended to support a library above a particular version. If we notice that naked doesn't work for a newer version of a given library, then a new version of naked should be released to handle said library version. You may refer to the pyproject.toml file to view library support.

How can I trust this is correct?

This package is really easy to unit test. One simply has to compare the outputs of the model with its "naked" version and check that the outputs are identical. Check out the test_naked.py file if you're curious.

How should I handle feature names?

Let's take the example of a multi-class logistic regression trained on the wine dataset.

from sklearn import datasets
from sklearn import linear_model
from sklearn import pipeline
from sklearn import preprocessing

dataset = datasets.load_wine()
X = dataset.data
y = dataset.target
model = pipeline.make_pipeline(
    preprocessing.StandardScaler(),
    linear_model.LogisticRegression()
)
model.fit(X, y)

By default, the strip function produces a function that takes as input a list of feature values. Instead, let's say we want to evaluate the function on a dictionary of features, thus associating each feature value with a name.

x = dict(zip(dataset.feature_names, X[0]))
print(x)
{'alcohol': 14.23,
 'malic_acid': 1.71,
 'ash': 2.43,
 'alcalinity_of_ash': 15.6,
 'magnesium': 127.0,
 'total_phenols': 2.8,
 'flavanoids': 3.06,
 'nonflavanoid_phenols': 0.28,
 'proanthocyanins': 2.29,
 'color_intensity': 5.64,
 'hue': 1.04,
 'od280/od315_of_diluted_wines': 3.92,
 'proline': 1065.0}

Passing the feature names to the strip function will add a function that maps the features to a list.

naked.strip(model, input_names=dataset.feature_names)
def handle_input_names(x):
    names = ['alcohol', 'malic_acid', 'ash', 'alcalinity_of_ash', 'magnesium', 'total_phenols', 'flavanoids', 'nonflavanoid_phenols', 'proanthocyanins', 'color_intensity', 'hue', 'od280/od315_of_diluted_wines', 'proline']
    return [x[name] for name in names]

def standard_scaler(x):

    mean_ = [13.000617977528083, 2.336348314606741, 2.3665168539325854, 19.49494382022472, 99.74157303370787, 2.295112359550562, 2.0292696629213474, 0.36185393258426973, 1.5908988764044953, 5.058089882022473, 0.9574494382022468, 2.6116853932584254, 746.8932584269663]
    var_ = [0.6553597304633259, 1.241004080924126, 0.07484180027774268, 11.090030614821362, 202.84332786264366, 0.3894890323191514, 0.9921135115515715, 0.015401619113748266, 0.32575424820098453, 5.344255847629093, 0.05195144969069561, 0.5012544628203511, 98609.60096578706]
    with_mean = True
    with_std = True

    def scale(x, m, v):
        if with_mean:
            x -= m
        if with_std:
            x /= v ** .5
        return x

    return [scale(xi, m, v) for xi, m, v in zip(x, mean_, var_)]

def logistic_regression(x):

    coef_ = [[0.8101347947338147, 0.20382073148760085, 0.47221241678911957, -0.8447843882542064, 0.04952904623674445, 0.21372479616642068, 0.6478750705319883, -0.19982499112990385, 0.13833867563545404, 0.17160966151451867, 0.13090887117218597, 0.7259506896985365, 1.07895948707047], [-1.0103233753629153, -0.44045952703036084, -0.8480739967718842, 0.5835732316278703, -0.09770602368275362, 0.027527982220605866, 0.35399157401383297, 0.21278279386396404, 0.2633610495737497, -1.0412707677956505, 0.6825215991118386, 0.05287634940648419, -1.1407929345327175], [0.20018858062910203, 0.23663879554275832, 0.37586157998276365, 0.26121115662633365, 0.048176977446007865, -0.2412527783870254, -1.0018666445458222, -0.012957802734061021, -0.40169972520920566, 0.8696611062811332, -0.8134304702840255, -0.7788270391050198, 0.061833447462247046]]
    intercept_ = [0.41229358315867787, 0.7048164121833935, -1.1171099953420585]

    import math

    logits = [
        b + sum(xi * wi for xi, wi in zip(x, w))
        for w, b in zip(coef_, intercept_)
    ]

    # Sigmoid activation for binary classification
    if len(logits) == 1:
        p_true = 1 / (1 + math.exp(-logits[0]))
        return [1 - p_true, p_true]

    # Softmax activation for multi-class classification
    z_max = max(logits)
    exp = [math.exp(z - z_max) for z in logits]
    exp_sum = sum(exp)
    return [e / exp_sum for e in exp]

def pipeline(x):
    x = handle_input_names(x)
    x = standard_scaler(x)
    x = logistic_regression(x)
    return x

What about output names?

You can also specify the output_names parameter to associate each output value with a name. Of course, this doesn't work for cases where a single value is produced, such as single-target regression.

naked.strip(model, input_names=dataset.feature_names, output_names=dataset.target_names)
def handle_input_names(x):
    names = ['alcohol', 'malic_acid', 'ash', 'alcalinity_of_ash', 'magnesium', 'total_phenols', 'flavanoids', 'nonflavanoid_phenols', 'proanthocyanins', 'color_intensity', 'hue', 'od280/od315_of_diluted_wines', 'proline']
    return [x[name] for name in names]

def standard_scaler(x):

    mean_ = [13.000617977528083, 2.336348314606741, 2.3665168539325854, 19.49494382022472, 99.74157303370787, 2.295112359550562, 2.0292696629213474, 0.36185393258426973, 1.5908988764044953, 5.058089882022473, 0.9574494382022468, 2.6116853932584254, 746.8932584269663]
    var_ = [0.6553597304633259, 1.241004080924126, 0.07484180027774268, 11.090030614821362, 202.84332786264366, 0.3894890323191514, 0.9921135115515715, 0.015401619113748266, 0.32575424820098453, 5.344255847629093, 0.05195144969069561, 0.5012544628203511, 98609.60096578706]
    with_mean = True
    with_std = True

    def scale(x, m, v):
        if with_mean:
            x -= m
        if with_std:
            x /= v ** .5
        return x

    return [scale(xi, m, v) for xi, m, v in zip(x, mean_, var_)]

def logistic_regression(x):

    coef_ = [[0.8101347947338147, 0.20382073148760085, 0.47221241678911957, -0.8447843882542064, 0.04952904623674445, 0.21372479616642068, 0.6478750705319883, -0.19982499112990385, 0.13833867563545404, 0.17160966151451867, 0.13090887117218597, 0.7259506896985365, 1.07895948707047], [-1.0103233753629153, -0.44045952703036084, -0.8480739967718842, 0.5835732316278703, -0.09770602368275362, 0.027527982220605866, 0.35399157401383297, 0.21278279386396404, 0.2633610495737497, -1.0412707677956505, 0.6825215991118386, 0.05287634940648419, -1.1407929345327175], [0.20018858062910203, 0.23663879554275832, 0.37586157998276365, 0.26121115662633365, 0.048176977446007865, -0.2412527783870254, -1.0018666445458222, -0.012957802734061021, -0.40169972520920566, 0.8696611062811332, -0.8134304702840255, -0.7788270391050198, 0.061833447462247046]]
    intercept_ = [0.41229358315867787, 0.7048164121833935, -1.1171099953420585]

    import math

    logits = [
        b + sum(xi * wi for xi, wi in zip(x, w))
        for w, b in zip(coef_, intercept_)
    ]

    # Sigmoid activation for binary classification
    if len(logits) == 1:
        p_true = 1 / (1 + math.exp(-logits[0]))
        return [1 - p_true, p_true]

    # Softmax activation for multi-class classification
    z_max = max(logits)
    exp = [math.exp(z - z_max) for z in logits]
    exp_sum = sum(exp)
    return [e / exp_sum for e in exp]

def handle_output_names(x):
    names = ['class_0' 'class_1' 'class_2']
    return dict(zip(names, x))

def pipeline(x):
    x = handle_input_names(x)
    x = standard_scaler(x)
    x = logistic_regression(x)
    x = handle_output_names(x)
    return x

As you can see, by specifying input_names as well as output_names, we obtain a pipeline of functions which takes as input a dictionary and produces a dictionary.

Development workflow

git clone https://github.com/MaxHalford/naked
cd naked
poetry install
poetry shell
pytest

Things to do

  • Implement more models. For instance it should quite straightforward to support LightGBM.
  • Remove useless branching conditions. Parameters are currently handled via if statements. Ideally it would be nice to remove the if statements and only keep the code that will actually run.

License

MIT

Owner
Max Halford
Data wizard @alan-eu. PhD in machine learning applied to query optimization. Kaggle competitions Master. Online machine learning nut.
Max Halford
Generative Adversarial Networks for High Energy Physics extended to a multi-layer calorimeter simulation

CaloGAN Simulating 3D High Energy Particle Showers in Multi-Layer Electromagnetic Calorimeters with Generative Adversarial Networks. This repository c

Deep Learning for HEP 101 Nov 13, 2022
Irrigation controller for Home Assistant

Irrigation Unlimited This integration is for irrigation systems large and small. It can offer some complex arrangements without large and messy script

Robert Cook 176 Jan 02, 2023
Semantic segmentation models, datasets and losses implemented in PyTorch.

Semantic Segmentation in PyTorch Semantic Segmentation in PyTorch Requirements Main Features Models Datasets Losses Learning rate schedulers Data augm

Yassine 1.3k Jan 07, 2023
SynNet - synthetic tree generation using neural networks

SynNet This repo contains the code and analysis scripts for our amortized approach to synthetic tree generation using neural networks. Our model can s

Wenhao Gao 60 Dec 29, 2022
Kinetics-Data-Preprocessing

Kinetics-Data-Preprocessing Kinetics-400 and Kinetics-600 are common video recognition datasets used by popular video understanding projects like Slow

Kaihua Tang 7 Oct 27, 2022
Mix3D: Out-of-Context Data Augmentation for 3D Scenes (3DV 2021)

Mix3D: Out-of-Context Data Augmentation for 3D Scenes (3DV 2021) Alexey Nekrasov*, Jonas Schult*, Or Litany, Bastian Leibe, Francis Engelmann Mix3D is

Alexey Nekrasov 189 Dec 26, 2022
(IEEE TIP 2021) Regularized Densely-connected Pyramid Network for Salient Instance Segmentation

RDPNet IEEE TIP 2021: Regularized Densely-connected Pyramid Network for Salient Instance Segmentation PyTorch training and testing code are available.

Yu-Huan Wu 41 Oct 21, 2022
VarCLR: Variable Semantic Representation Pre-training via Contrastive Learning

    VarCLR: Variable Representation Pre-training via Contrastive Learning New: Paper accepted by ICSE 2022. Preprint at arXiv! This repository contain

squaresLab 32 Oct 24, 2022
code for the ICLR'22 paper: On Robust Prefix-Tuning for Text Classification

On Robust Prefix-Tuning for Text Classification Prefix-tuning has drawed much attention as it is a parameter-efficient and modular alternative to adap

Zonghan Yang 12 Nov 30, 2022
Official PyTorch implementation of "Physics-aware Difference Graph Networks for Sparsely-Observed Dynamics".

Physics-aware Difference Graph Networks for Sparsely-Observed Dynamics This repository is the official PyTorch implementation of "Physics-aware Differ

USC-Melady 46 Nov 20, 2022
Official code for the CVPR 2021 paper "How Well Do Self-Supervised Models Transfer?"

How Well Do Self-Supervised Models Transfer? This repository hosts the code for the experiments in the CVPR 2021 paper How Well Do Self-Supervised Mod

Linus Ericsson 157 Dec 16, 2022
Unsupervised Real-World Super-Resolution: A Domain Adaptation Perspective

Unofficial pytorch implementation of the paper "Unsupervised Real-World Super-Resolution: A Domain Adaptation Perspective"

16 Nov 21, 2022
Real-time Neural Representation Fusion for Robust Volumetric Mapping

NeuralBlox: Real-Time Neural Representation Fusion for Robust Volumetric Mapping Paper | Supplementary This repository contains the implementation of

ETHZ ASL 106 Dec 24, 2022
UNION: An Unreferenced Metric for Evaluating Open-ended Story Generation

UNION Automatic Evaluation Metric described in the paper UNION: An UNreferenced MetrIc for Evaluating Open-eNded Story Generation (EMNLP 2020). Please

50 Dec 30, 2022
C3d-pytorch - Pytorch porting of C3D network, with Sports1M weights

C3D for pytorch This is a pytorch porting of the network presented in the paper Learning Spatiotemporal Features with 3D Convolutional Networks How to

Davide Abati 311 Jan 06, 2023
Lama-cleaner: Image inpainting tool powered by LaMa

Lama-cleaner: Image inpainting tool powered by LaMa

Qing 5.8k Jan 05, 2023
[NeurIPS 2021]: Are Transformers More Robust Than CNNs? (Pytorch implementation & checkpoints)

Are Transformers More Robust Than CNNs? Pytorch implementation for NeurIPS 2021 Paper: Are Transformers More Robust Than CNNs? Our implementation is b

Yutong Bai 145 Dec 01, 2022
All course materials for the Zero to Mastery Deep Learning with TensorFlow course.

All course materials for the Zero to Mastery Deep Learning with TensorFlow course.

Daniel Bourke 3.4k Jan 07, 2023
A collection of Google research projects related to Federated Learning and Federated Analytics.

Federated Research Federated Research is a collection of research projects related to Federated Learning and Federated Analytics. Federated learning i

Google Research 483 Jan 05, 2023
U-Net: Convolutional Networks for Biomedical Image Segmentation

Deep Learning Tutorial for Kaggle Ultrasound Nerve Segmentation competition, using Keras This tutorial shows how to use Keras library to build deep ne

Yihui He 401 Nov 21, 2022