Explainer for black box models that predict molecule properties

Related tags

Deep Learningexmol
Overview

Explaining why that molecule

GitHub tests paper docs PyPI version MIT license

exmol is a package to explain black-box predictions of molecules. The package uses model agnostic explanations to help users understand why a molecule is predicted to have a property.

Install

pip install exmol

Counterfactual Generation

Our package implements the Model Agnostic Counterfactual Compounds with STONED (MACCS) to generate counterfactuals. A counterfactual can explain a prediction by showing what would have to change in the molecule to change its predicted class. Here is an eample of a counterfactual:

This package is not popular. If the package had a logo, it would be popular.

In addition to having a changed prediction, a molecular counterfactual must be similar to its base molecule as much as possible. Here is an example of a molecular counterfactual:

counterfactual demo

The counterfactual shows that if the carboxylic acid were an ester, the molecule would be active. It is up to the user to translate this set of structures into a meaningful sentence.

Usage

Let's assume you have a deep learning model my_model(s) that takes in one SMILES string and outputs a predicted binary class. To generate counterfactuals, we need to wrap our function so that it can take both SMILES and SELFIES, but it only needs to use one.

We first expand chemical space around the prediction of interest

import exmol

# mol of interest
base = 'CCCO'

samples = exmol.sample_space(base, lambda smi, sel: my_model(smi), batched=False)

Here we use a lambda to wrap our function and indicate our function can only take one SMILES string, not a list of them with batched=False. Now we select counterfactuals from that space and plot them.

cfs = exmol.cf_explain(samples)
exmol.plot_cf(cfs)

set of counterfactuals

We can also plot the space around the counterfactual. This is computed via PCA of the affinity matrix -- the similarity with the base molecule. Due to how similarity is calculated, the base is going to be the farthest from all other molecules. Thus your base should fall on the left (or right) extreme of your plot.

cfs = exmol.cf_explain(samples)
exmol.plot_space(samples, cfs)

chemical space

Each counterfactual is a Python dataclass with information allowing it to be used in your own analysis:

print(cfs[0])
Examples(
  smiles='CCOC(=O)c1ccc(N=CN(Cl)c2ccccc2)cc1',
  selfies='[C][C][O][C][Branch1_2][C][=O][C][=C][C][=C][Branch1_1][#C][N][=C][N][Branch1_1][C][Cl][C][=C][C][=C][C][=C][Ring1][Branch1_2][C][=C][Ring1][S]',
  similarity=0.8181818181818182,
  yhat=-5.459493637084961,
  index=1807,
  position=array([-6.11371691,  1.24629293]),
  is_origin=False,
  cluster=26,
  label='Counterfactual')

Chemical Space

When calling exmol.sample_space you can pass preset=<preset>, which can be one of the following:

  • 'narrow': Only one change to molecular structure, reduced set of possible bonds/elements
  • 'medium': Default. One or two changes to molecular structure, reduced set of possible bonds/elements
  • 'wide': One through five changes to molecular structure, large set of possible bonds/elements
  • 'chemed': A restrictive set where only pubchem molecules are considered. Experimental

You can also pass num_samples as a "request" for number of samples. You will typically end up with less due to degenerate molecules. See API for complete description.

SVG

Molecules are by default drawn as PNGs. If you would like to have them drawn as SVGs, call insert_svg after calling plot_space or plot_cf

import skunk
exmol.plot_cf(exps)
svg = exmol.insert_svg(exps, mol_fontsize=16)

# for Jupyter Notebook
skunk.display(svg)

# To save to file
with open('myplot.svg', 'w') as f:
    f.write(svg)

This is done with the skunk 🦨 library.

API and Docs

Read API here. You should also read the paper (see below) for a more exact description of the methods and implementation.

Citation

Please cite Wellawatte et al.

 @article{wellawatte_seshadri_white_2021,
 place={Cambridge},
 title={Model agnostic generation of counterfactual explanations for molecules},
 DOI={10.33774/chemrxiv-2021-4qkg8},
 journal={ChemRxiv},
 publisher={Cambridge Open Engage},
 author={Wellawatte, Geemi P and Seshadri, Aditi and White, Andrew D},
 year={2021}}

This content is a preprint and has not been peer-reviewed.

Comments
  • Add LIME explanations

    Add LIME explanations

    This is a big PR!

    • [x] Document LIME function
    • [x] Compute t-stats using examples that have non-zero weights
    • [x] Add plotting code for descriptors - needs SMARTS annotations for MACCS keys (166 files)
    • [x] Add plotting code for chemical space and fit
    • [x] Description in readme
    • [x] Clean up notebooks and add documentation
    • [x] Remove extra files
    • [x] Add LIME notebooks to CI?
    opened by hgandhi2411 11
  • Error while plotting counterfactuals using plot_cf()

    Error while plotting counterfactuals using plot_cf()

    plot_cf() function errors out with the following error. This behavior is also consistent across all notebooks in paper/.

    ---------------------------------------------------------------------------
    TypeError                                 Traceback (most recent call last)
    <ipython-input-10-b6c8ed26216e> in <module>
          1 fkw = {"figsize": (8, 6)}
          2 mpl.rc("axes", titlesize=12)
    ----> 3 exmol.plot_cf(exps, figure_kwargs=fkw, mol_size=(450, 400), nrows=1)
          4 
          5 plt.savefig("rf-simple.png", dpi=180)
    
    /gpfs/fs2/scratch/hgandhi/exmol/exmol/exmol.py in plot_cf(exps, fig, figure_kwargs, mol_size, mol_fontsize, nrows, ncols)
        682         title += f"\nf(x) = {e.yhat:.3f}"
        683         axs[i].set_title(title)
    --> 684         axs[i].imshow(np.asarray(img), gid=f"rdkit-img-{i}")
        685         axs[i].axis("off")
        686     for j in range(i, C * R):
    
    ~/.local/lib/python3.7/site-packages/matplotlib/__init__.py in inner(ax, data, *args, **kwargs)
       1359     def inner(ax, *args, data=None, **kwargs):
       1360         if data is None:
    -> 1361             return func(ax, *map(sanitize_sequence, args), **kwargs)
       1362 
       1363         bound = new_sig.bind(ax, *args, **kwargs)
    
    ~/.local/lib/python3.7/site-packages/matplotlib/axes/_axes.py in imshow(self, X, cmap, norm, aspect, interpolation, alpha, vmin, vmax, origin, extent, filternorm, filterrad, resample, url, **kwargs)
       5607                               resample=resample, **kwargs)
       5608 
    -> 5609         im.set_data(X)
       5610         im.set_alpha(alpha)
       5611         if im.get_clip_path() is None:
    
    ~/.local/lib/python3.7/site-packages/matplotlib/image.py in set_data(self, A)
        699                 not np.can_cast(self._A.dtype, float, "same_kind")):
        700             raise TypeError("Image data of dtype {} cannot be converted to "
    --> 701                             "float".format(self._A.dtype))
        702 
        703         if self._A.ndim == 3 and self._A.shape[-1] == 1:
    
    TypeError: Image data of dtype <U14622 cannot be converted to float
    
    opened by hgandhi2411 6
  • Error after installation

    Error after installation

    Hi,

    First at all, thank you for your work!. I am obtaining a problem installing your library, o better say when I do "import exmol", I obtaing one error:"No module named 'dataclasses'".

    I have installed as: pip install exmol...

    Thanks!

    opened by PARODBE 6
  • CODEX Example

    CODEX Example

    While messing around with CODEX, I noticed it wants to compute ECFP4 fingerprints using a different method and this gives slightly different similarities. @geemi725 could you double-check the ECFP4 implementation we have is correct, or is the CODEX one correct?

    image

    opened by whitead 6
  • Object has no attribute '__code__'

    Object has no attribute '__code__'

    Hi there, I noticed that sample_space does not seem to work with class instances, because they do not have a __code__ attribute:

    import exmol
    class A:
        pass
    exmol.sample_space('C', A(), batched=True)
    
    AttributeError: 'A' object has no attribute '__code__'
    

    Is there any way around this other than forcing the call to a separate function?

    opened by oiao 5
  • The module 'exmol' has no attribute 'lime_explain'

    The module 'exmol' has no attribute 'lime_explain'

    In the notebook RF-lime.ipynb, the command

    exmol.lime_explain(space, descriptor_type=descriptor_type)

    gives a error module 'exmol' has no attribute 'lime_explain'

    Please, let me know how to fix this error. Thanks.

    opened by andresilvapimentel 5
  • Easier usage of explain

    Easier usage of explain

    Working through some examples, I've noted the following things:

    1. Descriptor type should have a default - maybe MACCS since the plots will show-up
    2. Maybe we should only save SVGs, rather than return unless prompted
    3. We should do string comparison for descriptor types using lowercase strings, so that classic and Classic and ecfp are valid.
    4. We probably shouldn't save without a filename - it is unexpected
    opened by whitead 4
  • Allow using custom list of molecules

    Allow using custom list of molecules

    Hello @whitead, this is very nice package !

    I found the new chemed option very useful and thought extending it to any list of molecule would make sense.

    Here is the main change to the API:

    explanation = exmol.sample_space(
          "CCCC",
          model,
          preset="custom", #use custom preset
          batched=False,
          data=data, # provide list of smiles or molecules
    )
    

    Let me know if this PR make sense.

    opened by maclandrol 4
  • Target molecule frequently on the edge of sample space visualization

    Target molecule frequently on the edge of sample space visualization

    In your example provided in the code, the target molecule is on the edge of the sampled distribution (in the PCA plot). I also find this happens very frequently with my experiments on my model. I think this suggests that the sampling produces molecules that are not evenly distributed around the target. I just want to verify that this is a property of the STONED sampling algorithm, and not an artifact of the visualization code (which it does not seem to be). I've attached an example of my own, for both "narrow" and "medium" presets.

    preset="narrow", nmols=10

    explain_narrow_0 05_10

    preset="medium", nmols=10

    explain_medium_0 05_10

    opened by adamoyoung 3
  • Sanitizing SMILES removes chirality information

    Sanitizing SMILES removes chirality information

    On this line of sample_space(), chirality information of origin_smiles is removed. The output is then unsuitable as input to a chirality-aware ML model, e.g. to distinguish L vs. D amino acids which are important in models of binding affinity. Could the option to skip this sanitization step be provided to the user?

    PS: Great code base and beautiful visualizations! We're finding it very useful in explaining our Gaussian Process models. The future of SAR ←→ ML looks exciting.

    opened by tianyu-lu 2
  • Release 0.5.0 on pypi

    Release 0.5.0 on pypi

    Are you planning to release 0.5.0 on pypi? I am maintaining the conda package of exmol and I would like to bump it to 0.5.0. See https://github.com/conda-forge/exmol-feedstock

    Thanks!

    opened by hadim 2
  • run_STONED couldn't generate SMILES after 30 minutes

    run_STONED couldn't generate SMILES after 30 minutes

    For certain SMILES, run_STONED() failed to generate after running for so long. So far, one SMILES known to cause such issue is

    [Na+].[Na+].[Na+].[Na+].[Na+].[O-][S](=O)(=O)OCC[S](=O)(=O)c1cccc(Nc2nc(Cl)nc(Nc3cc(cc4C=C(\C(=N/Nc5ccc6c(cccc6[S]([O-])(=O)=O)c5[S]([O-])(=O)=O)C(=O)c34)[S]([O-])(=O)=O)[S]([O-])(=O)=O)n2)c1

    Here is how I use the function: exmol.run_stoned(smiles, num_samples=10, max_mutations=1).

    opened by qcampbel 2
Releases(v2.2.1)
Developed an optimized algorithm which finds the most optimal path between 2 points in a 3D Maze using various AI search techniques like BFS, DFS, UCS, Greedy BFS and A*

Developed an optimized algorithm which finds the most optimal path between 2 points in a 3D Maze using various AI search techniques like BFS, DFS, UCS, Greedy BFS and A*. The algorithm was extremely

1 Mar 28, 2022
Code repository for Self-supervised Structure-sensitive Learning, CVPR'17

Self-supervised Structure-sensitive Learning (SSL) Ke Gong, Xiaodan Liang, Xiaohui Shen, Liang Lin, "Look into Person: Self-supervised Structure-sensi

Clay Gong 219 Dec 29, 2022
HAR-stacked-residual-bidir-LSTMs - Deep stacked residual bidirectional LSTMs for HAR

HAR-stacked-residual-bidir-LSTM The project is based on this repository which is presented as a tutorial. It consists of Human Activity Recognition (H

Guillaume Chevalier 287 Dec 27, 2022
GNEE - GAT Neural Event Embeddings

GNEE - GAT Neural Event Embeddings This repository contains source code for the GNEE (GAT Neural Event Embeddings) method introduced in the paper: "Se

João Pedro Rodrigues Mattos 0 Sep 15, 2021
PAIRED in PyTorch 🔥

PAIRED This codebase provides a PyTorch implementation of Protagonist Antagonist Induced Regret Environment Design (PAIRED), which was first introduce

UCL DARK Lab 46 Dec 12, 2022
Tilted Empirical Risk Minimization (ICLR '21)

Tilted Empirical Risk Minimization This repository contains the implementation for the paper Tilted Empirical Risk Minimization ICLR 2021 Empirical ri

Tian Li 40 Nov 28, 2022
Official PyTorch implementation of the paper "TEMOS: Generating diverse human motions from textual descriptions"

TEMOS: TExt to MOtionS Generating diverse human motions from textual descriptions Description Official PyTorch implementation of the paper "TEMOS: Gen

Mathis Petrovich 187 Dec 27, 2022
Fast and accurate optimisation for registration with little learningconvexadam

convexAdam Learn2Reg 2021 Submission Fast and accurate optimisation for registration with little learning Excellent results on Learn2Reg 2021 challeng

17 Dec 06, 2022
Multi-modal Vision Transformers Excel at Class-agnostic Object Detection

Multi-modal Vision Transformers Excel at Class-agnostic Object Detection

Muhammad Maaz 206 Jan 04, 2023
Final project for Intro to CS class.

Financial Analysis Web App https://share.streamlit.io/mayurk1/fin-web-app-final-project/webApp.py 1. Project Description This project is a technical a

Mayur Khanna 1 Dec 10, 2021
Implementation for Simple Spectral Graph Convolution in ICLR 2021

Simple Spectral Graph Convolutional Overview This repo contains an example implementation of the Simple Spectral Graph Convolutional (S^2GC) model. Th

allenhaozhu 64 Dec 31, 2022
QMagFace: Simple and Accurate Quality-Aware Face Recognition

Quality-Aware Face Recognition 26.11.2021 start readme QMagFace: Simple and Accurate Quality-Aware Face Recognition Research Paper Implementation - To

Philipp Terhörst 59 Jan 04, 2023
An MQA (Studio, originalSampleRate) identifier for lossless flac files written in Python.

An MQA (Studio, originalSampleRate) identifier for "lossless" flac files written in Python.

Daniel 10 Oct 03, 2022
Pacman-AI - AI project designed by UC Berkeley. Designed reflex and minimax agents for the game Pacman.

Pacman AI Jussi Doherty CAP 4601 - Introduction to Artificial Intelligence - Fall 2020 Python version 3.0+ Source of this project This repo contains a

Jussi Doherty 1 Jan 03, 2022
This git repo contains the implementation of my ML project on Heart Disease Prediction

Introduction This git repo contains the implementation of my ML project on Heart Disease Prediction. This is a real-world machine learning model/proje

Aryan Dutta 1 Feb 02, 2022
This is the code for ACL2021 paper A Unified Generative Framework for Aspect-Based Sentiment Analysis

This is the code for ACL2021 paper A Unified Generative Framework for Aspect-Based Sentiment Analysis Install the package in the requirements.txt, the

108 Dec 23, 2022
This is the official github repository of the Met dataset

The Met dataset This is the official github repository of the Met dataset. The official webpage of the dataset can be found here. What is it? This cod

Nikolaos-Antonios Ypsilantis 35 Dec 17, 2022
Learning Calibrated-Guidance for Object Detection in Aerial Images

Learning Calibrated-Guidance for Object Detection in Aerial Images arxiv We propose a simple yet effective Calibrated-Guidance (CG) scheme to enhance

51 Sep 22, 2022
Efficiently Disentangle Causal Representations

Efficiently Disentangle Causal Representations Install dependency pip install -r requirements.txt Main experiments Causality direction prediction cd

4 Apr 01, 2022
Example Of Fine-Tuning BERT For Named-Entity Recognition Task And Preparing For Cloud Deployment Using Flask, React, And Docker

Example Of Fine-Tuning BERT For Named-Entity Recognition Task And Preparing For Cloud Deployment Using Flask, React, And Docker This repository contai

Nikita 12 Dec 14, 2022