Fastshap: A fast, approximate shap kernel

Related tags

Deep Learningfastshap
Overview

fastshap: A fast, approximate shap kernel

fastshap was designed to be:

  • Fast Calculating shap values can take an extremely long time. fastshap utilizes inner and outer batch assignments to keep the calculations inside vectorized operations as often as it can.
  • Used on Tabular Data Can accept numpy arrays or pandas DataFrames, and can handle categorical variables natively. As of right now, only 1 dimensional outputs are accepted.

WARNING This package specifically offers a kernel explainer, which can calculate approximate shap values of f(X) towards y for any function f. Much faster shap solutions are available specifically for gradient boosted trees.

Installation

This package can be installed using either pip or conda, through conda-forge:

# Using pip
$ pip install fastshap --no-cache-dir

You can also download the latest development version from this repository. If you want to install from github with conda, you must first run conda install pip git.

$ pip install git+https://github.com/AnotherSamWilson/fastshap.git

Basic Usage

We will use the iris dataset for this example. Here, we load the data and train a simple lightgbm model on the dataset:

from sklearn.datasets import load_iris
import pandas as pd
import lightgbm as lgb
import numpy as np

# Define our dataset and target variable
data = pd.concat(load_iris(as_frame=True,return_X_y=True),axis=1)
data.rename({"target": "species"}, inplace=True, axis=1)
data["species"] = data["species"].astype("category")
target = data.pop("sepal length (cm)")

# Train our model
dtrain = lgb.Dataset(data=data, label=target)
lgbmodel = lgb.train(
    params={"seed": 1, "verbose": -1},
    train_set=dtrain,
    num_boost_round=10
)

# Define the function we wish to build shap values for.
model = lgbmodel.predict

preds = model(data)

We now have a model which takes a Pandas dataframe, and returns predictions. We can create an explainer that will use data as a background dataset to calculate the shap values of any dataset we wish:

import fastshap

ke = fastshap.KernelExplainer(model, data)
sv = ke.calculate_shap_values(data, verbose=False)

print(all(preds == sv.sum(1)))
## True

Stratifying the Background Set

We can select a subset of our data to act as a background set. By stratifying the background set on the results of the model output, we will usually get very similar results, while decreasing the caculation time drastically.

ke.stratify_background_set(5)
sv2 = ke.calculate_shap_values(
  data, 
  background_fold_to_use=0,
  verbose=False
)

print(np.abs(sv2 - sv).mean(0))
## [1.74764532e-03 1.61829094e-02 1.99534408e-03 4.02640884e-16
##  1.71084747e-02]

What we did is break up our background set into 10 different sets, stratified by the model output. We then used the first of these sets as our background set. We then compared the average difference between these shap values, and the shap values we obtained from using the entire dataset.

Choosing Batch Sizes

If the entire process was vectorized, it would require an array of size (# Samples * # Coalitions * # Background samples, # Columns). Where # Coalitions is the sum of the total number of coalitions that are going to be run. Even for small datasets, this becomes enormous. fastshap breaks this array up into chunks by splitting the process into a series of batches.

This is a list of the large arrays and their maximum size:

  • Global
    • Mask Matrix (# Coalitions, # Columns) dtype = int8
  • Outer Batch
    • Linear Targets (Total Coalition Combinations, Outer Batch Size) dtype = adaptive
  • Inner Batch
    • Model Evaluation Features (Inner Batch Size, # background samples) dtype = adaptive

The adaptive datatypes of the arrays above will be matched to the data types of the model output. Therefore, if your model returns float32, these arrays will be stored as float32. The final, returned shap values will also be returned as the datatype returned by the model.

These theoretical sizes can be calculated directly so that the user can determine appropriate batch sizes for their machine:

# Combines our background data back into 1 DataFrame
ke.stratify_background_set(1)
(
    mask_matrix_size, 
    linear_target_size, 
    inner_model_eval_set_size
) = ke.get_theoretical_array_expansion_sizes(
    outer_batch_size=150,
    inner_batch_size=150,
    n_coalition_sizes=3,
    background_fold_to_use=None,
)

print(
  np.product(linear_target_size) + np.product(inner_model_eval_set_size)
)
## 92100

For the iris dataset, even if we sent the entire set (150 rows) through as one batch, we only need 92100 elements stored in arrays. This is manageable on most machines. However, this number grows extremely quickly with the samples and number of columns. It is highly advised to determine a good batch scheme before running this process.

Specifying a Custom Linear Model

Any linear model available from sklearn.linear_model can be used to calculate the shap values. If you wish for some sparsity in the shap values, you can use Lasso regression:

from sklearn.linear_model import Lasso

# Use our entire background set
ke.stratify_background_set(1)
sv_lasso = ke.calculate_shap_values(
  data, 
  background_fold_to_use=0,
  linear_model=Lasso(alpha=0.1),
  verbose=False
)

print(sv_lasso[0,:])
## [-0.         -0.33797832 -0.         -0.14634971  5.84333333]

The default model used is sklearn.linear_model.LinearRegression.

Owner
Samuel Wilson
Samuel Wilson
Source for the paper "Universal Activation Function for machine learning"

Universal Activation Function Tensorflow and Pytorch source code for the paper Yuen, Brosnan, Minh Tu Hoang, Xiaodai Dong, and Tao Lu. "Universal acti

4 Dec 03, 2022
KE-Dialogue: Injecting knowledge graph into a fully end-to-end dialogue system.

Learning Knowledge Bases with Parameters for Task-Oriented Dialogue Systems This is the implementation of the paper: Learning Knowledge Bases with Par

CAiRE 42 Nov 10, 2022
VarCLR: Variable Semantic Representation Pre-training via Contrastive Learning

    VarCLR: Variable Representation Pre-training via Contrastive Learning New: Paper accepted by ICSE 2022. Preprint at arXiv! This repository contain

squaresLab 32 Oct 24, 2022
[ICLR2021oral] Rethinking Architecture Selection in Differentiable NAS

DARTS-PT Code accompanying the paper ICLR'2021: Rethinking Architecture Selection in Differentiable NAS Ruochen Wang, Minhao Cheng, Xiangning Chen, Xi

Ruochen Wang 86 Dec 27, 2022
Async API for controlling Hue Lights

Hue API Async API for controlling Hue Lights Documentation: hue-api.nirantak.com Source: github.com/nirantak/hue-api Installation This is an async cli

Nirantak Raghav 4 Nov 16, 2022
This is the official implement of paper "ActionCLIP: A New Paradigm for Action Recognition"

This is an official pytorch implementation of ActionCLIP: A New Paradigm for Video Action Recognition [arXiv] Overview Content Prerequisites Data Prep

268 Jan 09, 2023
Our implementation used for the MICCAI 2021 FLARE Challenge titled 'Efficient Multi-Organ Segmentation Using SpatialConfiguartion-Net with Low GPU Memory Requirements'.

Efficient Multi-Organ Segmentation Using SpatialConfiguartion-Net with Low GPU Memory Requirements Our implementation used for the MICCAI 2021 FLARE C

Franz Thaler 3 Sep 27, 2022
This repository contains code demonstrating the methods outlined in Path Signature Area-Based Causal Discovery in Coupled Time Series presented at Causal Analysis Workshop 2021.

signed-area-causal-inference This repository contains code demonstrating the methods outlined in Path Signature Area-Based Causal Discovery in Coupled

Will Glad 1 Mar 11, 2022
Adversarial Adaptation with Distillation for BERT Unsupervised Domain Adaptation

Knowledge Distillation for BERT Unsupervised Domain Adaptation Official PyTorch implementation | Paper Abstract A pre-trained language model, BERT, ha

Minho Ryu 29 Nov 30, 2022
SPCL: A New Framework for Domain Adaptive Semantic Segmentation via Semantic Prototype-based Contrastive Learning

SPCL SPCL: A New Framework for Domain Adaptive Semantic Segmentation via Semantic Prototype-based Contrastive Learning Update on 2021/11/25: ArXiv Ver

Binhui Xie (谢斌辉) 11 Oct 29, 2022
[ICCV 2021 Oral] Mining Latent Classes for Few-shot Segmentation

Mining Latent Classes for Few-shot Segmentation Lihe Yang, Wei Zhuo, Lei Qi, Yinghuan Shi, Yang Gao. This codebase contains baseline of our paper Mini

Lihe Yang 66 Nov 29, 2022
Greedy Gaussian Segmentation

GGS Greedy Gaussian Segmentation (GGS) is a Python solver for efficiently segmenting multivariate time series data. For implementation details, please

Stanford University Convex Optimization Group 72 Dec 07, 2022
Explaining Deep Neural Networks - A comparison of different CAM methods based on an insect data set

Explaining Deep Neural Networks - A comparison of different CAM methods based on an insect data set This is the repository for the Deep Learning proje

Robert Krug 3 Feb 06, 2022
RoIAlign & crop_and_resize for PyTorch

RoIAlign for PyTorch This is a PyTorch version of RoIAlign. This implementation is based on crop_and_resize and supports both forward and backward on

Long Chen 530 Jan 07, 2023
Semi-supervised learning for object detection

Source code for STAC: A Simple Semi-Supervised Learning Framework for Object Detection STAC is a simple yet effective SSL framework for visual object

Google Research 348 Dec 25, 2022
HeatNet is a python package that provides tools to build, train and evaluate neural networks designed to predict extreme heat wave events globally on daily to subseasonal timescales.

HeatNet HeatNet is a python package that provides tools to build, train and evaluate neural networks designed to predict extreme heat wave events glob

Google Research 6 Jul 07, 2022
Offline Reinforcement Learning with Implicit Q-Learning

Offline Reinforcement Learning with Implicit Q-Learning This repository contains the official implementation of Offline Reinforcement Learning with Im

Ilya Kostrikov 126 Jan 06, 2023
This repo is the official implementation for Multi-Scale Adaptive Graph Neural Network for Multivariate Time Series Forecasting

1 MAGNN This repo is the official implementation for Multi-Scale Adaptive Graph Neural Network for Multivariate Time Series Forecasting. 1.1 The frame

SZJ 12 Nov 08, 2022
Code for "Multi-Compound Transformer for Accurate Biomedical Image Segmentation"

News The code of MCTrans has been released. if you are interested in contributing to the standardization of the medical image analysis community, plea

97 Jan 05, 2023
Official PyTorch implementation of Learning Intra-Batch Connections for Deep Metric Learning (ICML 2021) published at International Conference on Machine Learning

About This repository the official PyTorch implementation of Learning Intra-Batch Connections for Deep Metric Learning. The config files contain the s

Dynamic Vision and Learning Group 41 Dec 10, 2022