Molecular Sets (MOSES): A benchmarking platform for molecular generation models

Overview

Molecular Sets (MOSES): A benchmarking platform for molecular generation models

Build Status PyPI version

Deep generative models are rapidly becoming popular for the discovery of new molecules and materials. Such models learn on a large collection of molecular structures and produce novel compounds. In this work, we introduce Molecular Sets (MOSES), a benchmarking platform to support research on machine learning for drug discovery. MOSES implements several popular molecular generation models and provides a set of metrics to evaluate the quality and diversity of generated molecules. With MOSES, we aim to standardize the research on molecular generation and facilitate the sharing and comparison of new models.

For more details, please refer to the paper.

If you are using MOSES in your research paper, please cite us as

@article{10.3389/fphar.2020.565644,
  title={{M}olecular {S}ets ({MOSES}): {A} {B}enchmarking {P}latform for {M}olecular {G}eneration {M}odels},
  author={Polykovskiy, Daniil and Zhebrak, Alexander and Sanchez-Lengeling, Benjamin and Golovanov, Sergey and Tatanov, Oktai and Belyaev, Stanislav and Kurbanov, Rauf and Artamonov, Aleksey and Aladinskiy, Vladimir and Veselov, Mark and Kadurin, Artur and Johansson, Simon and  Chen, Hongming and Nikolenko, Sergey and Aspuru-Guzik, Alan and Zhavoronkov, Alex},
  journal={Frontiers in Pharmacology},
  year={2020}
}

pipeline

Dataset

We propose a benchmarking dataset refined from the ZINC database.

The set is based on the ZINC Clean Leads collection. It contains 4,591,276 molecules in total, filtered by molecular weight in the range from 250 to 350 Daltons, a number of rotatable bonds not greater than 7, and XlogP less than or equal to 3.5. We removed molecules containing charged atoms or atoms besides C, N, S, O, F, Cl, Br, H or cycles longer than 8 atoms. The molecules were filtered via medicinal chemistry filters (MCFs) and PAINS filters.

The dataset contains 1,936,962 molecular structures. For experiments, we split the dataset into a training, test and scaffold test sets containing around 1.6M, 176k, and 176k molecules respectively. The scaffold test set contains unique Bemis-Murcko scaffolds that were not present in the training and test sets. We use this set to assess how well the model can generate previously unobserved scaffolds.

Models

Metrics

Besides standard uniqueness and validity metrics, MOSES provides other metrics to access the overall quality of generated molecules. Fragment similarity (Frag) and Scaffold similarity (Scaff) are cosine distances between vectors of fragment or scaffold frequencies correspondingly of the generated and test sets. Nearest neighbor similarity (SNN) is the average similarity of generated molecules to the nearest molecule from the test set. Internal diversity (IntDiv) is an average pairwise similarity of generated molecules. Fréchet ChemNet Distance (FCD) measures the difference in distributions of last layer activations of ChemNet. Novelty is a fraction of unique valid generated molecules not present in the training set.

Model Valid (↑) [email protected] (↑) [email protected] (↑) FCD (↓) SNN (↑) Frag (↑) Scaf (↑) IntDiv (↑) IntDiv2 (↑) Filters (↑) Novelty (↑)
Test TestSF Test TestSF Test TestSF Test TestSF
Train 1.0 1.0 1.0 0.008 0.4755 0.6419 0.5859 1.0 0.9986 0.9907 0.0 0.8567 0.8508 1.0 1.0
HMM 0.076±0.0322 0.623±0.1224 0.5671±0.1424 24.4661±2.5251 25.4312±2.5599 0.3876±0.0107 0.3795±0.0107 0.5754±0.1224 0.5681±0.1218 0.2065±0.0481 0.049±0.018 0.8466±0.0403 0.8104±0.0507 0.9024±0.0489 0.9994±0.001
NGram 0.2376±0.0025 0.974±0.0108 0.9217±0.0019 5.5069±0.1027 6.2306±0.0966 0.5209±0.001 0.4997±0.0005 0.9846±0.0012 0.9815±0.0012 0.5302±0.0163 0.0977±0.0142 0.8738±0.0002 0.8644±0.0002 0.9582±0.001 0.9694±0.001
Combinatorial 1.0±0.0 0.9983±0.0015 0.9909±0.0009 4.2375±0.037 4.5113±0.0274 0.4514±0.0003 0.4388±0.0002 0.9912±0.0004 0.9904±0.0003 0.4445±0.0056 0.0865±0.0027 0.8732±0.0002 0.8666±0.0002 0.9557±0.0018 0.9878±0.0008
CharRNN 0.9748±0.0264 1.0±0.0 0.9994±0.0003 0.0732±0.0247 0.5204±0.0379 0.6015±0.0206 0.5649±0.0142 0.9998±0.0002 0.9983±0.0003 0.9242±0.0058 0.1101±0.0081 0.8562±0.0005 0.8503±0.0005 0.9943±0.0034 0.8419±0.0509
AAE 0.9368±0.0341 1.0±0.0 0.9973±0.002 0.5555±0.2033 1.0572±0.2375 0.6081±0.0043 0.5677±0.0045 0.991±0.0051 0.9905±0.0039 0.9022±0.0375 0.0789±0.009 0.8557±0.0031 0.8499±0.003 0.996±0.0006 0.7931±0.0285
VAE 0.9767±0.0012 1.0±0.0 0.9984±0.0005 0.099±0.0125 0.567±0.0338 0.6257±0.0005 0.5783±0.0008 0.9994±0.0001 0.9984±0.0003 0.9386±0.0021 0.0588±0.0095 0.8558±0.0004 0.8498±0.0004 0.997±0.0002 0.6949±0.0069
JTN-VAE 1.0±0.0 1.0±0.0 0.9996±0.0003 0.3954±0.0234 0.9382±0.0531 0.5477±0.0076 0.5194±0.007 0.9965±0.0003 0.9947±0.0002 0.8964±0.0039 0.1009±0.0105 0.8551±0.0034 0.8493±0.0035 0.976±0.0016 0.9143±0.0058
LatentGAN 0.8966±0.0029 1.0±0.0 0.9968±0.0002 0.2968±0.0087 0.8281±0.0117 0.5371±0.0004 0.5132±0.0002 0.9986±0.0004 0.9972±0.0007 0.8867±0.0009 0.1072±0.0098 0.8565±0.0007 0.8505±0.0006 0.9735±0.0006 0.9498±0.0006

For comparison of molecular properties, we computed the Wasserstein-1 distance between distributions of molecules in the generated and test sets. Below, we provide plots for lipophilicity (logP), Synthetic Accessibility (SA), Quantitative Estimation of Drug-likeness (QED) and molecular weight.

logP SA
logP SA
weight QED
weight QED

Installation

PyPi

The simplest way to install MOSES (models and metrics) is to install RDKit: conda install -yq -c rdkit rdkit and then install MOSES (molsets) from pip (pip install molsets). If you want to use LatentGAN, you should also install additional dependencies using bash install_latentgan_dependencies.sh.

If you are using Ubuntu, you should also install sudo apt-get install libxrender1 libxext6 for RDKit.

Docker

  1. Install docker and nvidia-docker.

  2. Pull an existing image (4.1Gb to download) from DockerHub:

docker pull molecularsets/moses

or clone the repository and build it manually:

git clone https://github.com/molecularsets/moses.git
nvidia-docker image build --tag molecularsets/moses moses/
  1. Create a container:
nvidia-docker run -it --name moses --network="host" --shm-size 10G molecularsets/moses
  1. The dataset and source code are available inside the docker container at /moses:
docker exec -it molecularsets/moses bash

Manually

Alternatively, install dependencies and MOSES manually.

  1. Clone the repository:
git lfs install
git clone https://github.com/molecularsets/moses.git
  1. Install RDKit for metrics calculation.

  2. Install MOSES:

python setup.py install
  1. (Optional) Install dependencies for LatentGAN:
bash install_latentgan_dependencies.sh

Benchmarking your models

  • Install MOSES as described in the previous section.

  • Get train, test and test_scaffolds datasets using the following code:

import moses

train = moses.get_dataset('train')
test = moses.get_dataset('test')
test_scaffolds = moses.get_dataset('test_scaffolds')
  • You can use a standard torch DataLoader in your models. We provide a simple StringDataset class for convenience:
from torch.utils.data import DataLoader
from moses import CharVocab, StringDataset

train = moses.get_dataset('train')
vocab = CharVocab.from_data(train)
train_dataset = StringDataset(vocab, train)
train_dataloader = DataLoader(
    train_dataset, batch_size=512,
    shuffle=True, collate_fn=train_dataset.default_collate
)

for with_bos, with_eos, lengths in train_dataloader:
    ...
  • Calculate metrics from your model's samples. We recomend sampling at least 30,000 molecules:
import moses
metrics = moses.get_all_metrics(list_of_generated_smiles)
  • Add generated samples and metrics to your repository. Run the experiment multiple times to estimate the variance of the metrics.

Reproducing the baselines

End-to-End launch

You can run pretty much everything with:

python scripts/run.py

This will split the dataset, train the models, generate new molecules, and calculate the metrics. Evaluation results will be saved in metrics.csv.

You can specify the GPU device index as cuda:n (or cpu for CPU) and/or model by running:

python scripts/run.py --device cuda:1 --model aae

For more details run python scripts/run.py --help.

You can reproduce evaluation of all models with several seeds by running:

sh scripts/run_all_models.sh

Training

python scripts/train.py <model name> \
       --train_load <train dataset> \
       --model_save <path to model> \
       --config_save <path to config> \
       --vocab_save <path to vocabulary>

To get a list of supported models run python scripts/train.py --help.

For more details of certain model run python scripts/train.py --help .

Generation

python scripts/sample.py <model name> \
       --model_load <path to model> \
       --vocab_load <path to vocabulary> \
       --config_load <path to config> \
       --n_samples <number of samples> \
       --gen_save <path to generated dataset>

To get a list of supported models run python scripts/sample.py --help.

For more details of certain model run python scripts/sample.py --help .

Evaluation

python scripts/eval.py \
       --ref_path <reference dataset> \
       --gen_path <generated dataset>

For more details run python scripts/eval.py --help.

Owner
Neelesh C A
Neelesh C A
A PyTorch Implementation of Single Shot MultiBox Detector

SSD: Single Shot MultiBox Object Detector, in PyTorch A PyTorch implementation of Single Shot MultiBox Detector from the 2016 paper by Wei Liu, Dragom

Max deGroot 4.8k Jan 07, 2023
PyTorch version of the paper 'Enhanced Deep Residual Networks for Single Image Super-Resolution' (CVPRW 2017)

About PyTorch 1.2.0 Now the master branch supports PyTorch 1.2.0 by default. Due to the serious version problem (especially torch.utils.data.dataloade

Sanghyun Son 2.1k Dec 27, 2022
Joint Detection and Identification Feature Learning for Person Search

Person Search Project This repository hosts the code for our paper Joint Detection and Identification Feature Learning for Person Search. The code is

712 Dec 17, 2022
A library for hidden semi-Markov models with explicit durations

hsmmlearn hsmmlearn is a library for unsupervised learning of hidden semi-Markov models with explicit durations. It is a port of the hsmm package for

Joris Vankerschaver 69 Dec 20, 2022
A real-time approach for mapping all human pixels of 2D RGB images to a 3D surface-based model of the body

DensePose: Dense Human Pose Estimation In The Wild Rıza Alp Güler, Natalia Neverova, Iasonas Kokkinos [densepose.org] [arXiv] [BibTeX] Dense human pos

Meta Research 6.4k Jan 01, 2023
A PyTorch-Based Framework for Deep Learning in Computer Vision

TorchCV: A PyTorch-Based Framework for Deep Learning in Computer Vision @misc{you2019torchcv, author = {Ansheng You and Xiangtai Li and Zhen Zhu a

Donny You 2.2k Jan 09, 2023
IAUnet: Global Context-Aware Feature Learning for Person Re-Identification

IAUnet This repository contains the code for the paper: IAUnet: Global Context-Aware Feature Learning for Person Re-Identification Ruibing Hou, Bingpe

30 Jul 14, 2022
RepVGG: Making VGG-style ConvNets Great Again

RepVGG: Making VGG-style ConvNets Great Again (PyTorch) This is a super simple ConvNet architecture that achieves over 80% top-1 accuracy on ImageNet

2.8k Jan 04, 2023
CS_Final_Metal_surface_detection - This is a final project for CoderSchool Machine Learning bootcamp on 29/12/2021.

CS_Final_Metal_surface_detection This is a final project for CoderSchool Machine Learning bootcamp on 29/12/2021. The project is based on the dataset

Cuong Vo 1 Dec 29, 2021
Example scripts for the detection of lanes using the ultra fast lane detection model in Tensorflow Lite.

TFlite Ultra Fast Lane Detection Inference Example scripts for the detection of lanes using the ultra fast lane detection model in Tensorflow Lite. So

Ibai Gorordo 12 Aug 27, 2022
mPose3D, a mmWave-based 3D human pose estimation model.

mPose3D, a mmWave-based 3D human pose estimation model.

KylinChen 35 Nov 08, 2022
Code of U2Fusion: a unified unsupervised image fusion network for multiple image fusion tasks, including multi-modal, multi-exposure and multi-focus image fusion.

U2Fusion Code of U2Fusion: a unified unsupervised image fusion network for multiple image fusion tasks, including multi-modal (VIS-IR, medical), multi

Han Xu 129 Dec 11, 2022
A solution to the 2D Ising model of ferromagnetism, implemented using the Metropolis algorithm

Solving the Ising model on a 2D lattice using the Metropolis Algorithm Introduction The Ising model is a simplified model of ferromagnetism, the pheno

Rohit Prabhu 5 Nov 13, 2022
Code for: Gradient-based Hierarchical Clustering using Continuous Representations of Trees in Hyperbolic Space. Nicholas Monath, Manzil Zaheer, Daniel Silva, Andrew McCallum, Amr Ahmed. KDD 2019.

gHHC Code for: Gradient-based Hierarchical Clustering using Continuous Representations of Trees in Hyperbolic Space. Nicholas Monath, Manzil Zaheer, D

Nicholas Monath 35 Nov 16, 2022
OCR-D wrapper for detectron2 based segmentation models

ocrd_detectron2 OCR-D wrapper for detectron2 based segmentation models Introduction Installation Usage OCR-D processor interface ocrd-detectron2-segm

Robert Sachunsky 13 Dec 06, 2022
Performance Analysis of Multi-user NOMA Wireless-Powered mMTC Networks: A Stochastic Geometry Approach

Performance Analysis of Multi-user NOMA Wireless-Powered mMTC Networks: A Stochastic Geometry Approach Thanh Luan Nguyen, Tri Nhu Do, Georges Kaddoum

Thanh Luan Nguyen 2 Oct 10, 2022
PGPortfolio: Policy Gradient Portfolio, the source code of "A Deep Reinforcement Learning Framework for the Financial Portfolio Management Problem"(https://arxiv.org/pdf/1706.10059.pdf).

This is the original implementation of our paper, A Deep Reinforcement Learning Framework for the Financial Portfolio Management Problem (arXiv:1706.1

Zhengyao Jiang 1.5k Dec 29, 2022
Real-time LIDAR-based Urban Road and Sidewalk detection for Autonomous Vehicles 🚗

urban_road_filter: a real-time LIDAR-based urban road and sidewalk detection algorithm for autonomous vehicles Dependency ROS (tested with Kinetic and

JKK - Vehicle Industry Research Center 180 Dec 12, 2022
Dense Contrastive Learning (DenseCL) for self-supervised representation learning, CVPR 2021.

Dense Contrastive Learning for Self-Supervised Visual Pre-Training This project hosts the code for implementing the DenseCL algorithm for se

Xinlong Wang 491 Jan 03, 2023
To prepare an image processing model to classify the type of disaster based on the image dataset

Disaster Classificiation using CNNs bunnysaini/Disaster-Classificiation Goal To prepare an image processing model to classify the type of disaster bas

Bunny Saini 1 Jan 24, 2022