Molecular Sets (MOSES): A benchmarking platform for molecular generation models

Overview

Molecular Sets (MOSES): A benchmarking platform for molecular generation models

Build Status PyPI version

Deep generative models are rapidly becoming popular for the discovery of new molecules and materials. Such models learn on a large collection of molecular structures and produce novel compounds. In this work, we introduce Molecular Sets (MOSES), a benchmarking platform to support research on machine learning for drug discovery. MOSES implements several popular molecular generation models and provides a set of metrics to evaluate the quality and diversity of generated molecules. With MOSES, we aim to standardize the research on molecular generation and facilitate the sharing and comparison of new models.

For more details, please refer to the paper.

If you are using MOSES in your research paper, please cite us as

@article{10.3389/fphar.2020.565644,
  title={{M}olecular {S}ets ({MOSES}): {A} {B}enchmarking {P}latform for {M}olecular {G}eneration {M}odels},
  author={Polykovskiy, Daniil and Zhebrak, Alexander and Sanchez-Lengeling, Benjamin and Golovanov, Sergey and Tatanov, Oktai and Belyaev, Stanislav and Kurbanov, Rauf and Artamonov, Aleksey and Aladinskiy, Vladimir and Veselov, Mark and Kadurin, Artur and Johansson, Simon and  Chen, Hongming and Nikolenko, Sergey and Aspuru-Guzik, Alan and Zhavoronkov, Alex},
  journal={Frontiers in Pharmacology},
  year={2020}
}

pipeline

Dataset

We propose a benchmarking dataset refined from the ZINC database.

The set is based on the ZINC Clean Leads collection. It contains 4,591,276 molecules in total, filtered by molecular weight in the range from 250 to 350 Daltons, a number of rotatable bonds not greater than 7, and XlogP less than or equal to 3.5. We removed molecules containing charged atoms or atoms besides C, N, S, O, F, Cl, Br, H or cycles longer than 8 atoms. The molecules were filtered via medicinal chemistry filters (MCFs) and PAINS filters.

The dataset contains 1,936,962 molecular structures. For experiments, we split the dataset into a training, test and scaffold test sets containing around 1.6M, 176k, and 176k molecules respectively. The scaffold test set contains unique Bemis-Murcko scaffolds that were not present in the training and test sets. We use this set to assess how well the model can generate previously unobserved scaffolds.

Models

Metrics

Besides standard uniqueness and validity metrics, MOSES provides other metrics to access the overall quality of generated molecules. Fragment similarity (Frag) and Scaffold similarity (Scaff) are cosine distances between vectors of fragment or scaffold frequencies correspondingly of the generated and test sets. Nearest neighbor similarity (SNN) is the average similarity of generated molecules to the nearest molecule from the test set. Internal diversity (IntDiv) is an average pairwise similarity of generated molecules. Fréchet ChemNet Distance (FCD) measures the difference in distributions of last layer activations of ChemNet. Novelty is a fraction of unique valid generated molecules not present in the training set.

Model Valid (↑) [email protected] (↑) [email protected] (↑) FCD (↓) SNN (↑) Frag (↑) Scaf (↑) IntDiv (↑) IntDiv2 (↑) Filters (↑) Novelty (↑)
Test TestSF Test TestSF Test TestSF Test TestSF
Train 1.0 1.0 1.0 0.008 0.4755 0.6419 0.5859 1.0 0.9986 0.9907 0.0 0.8567 0.8508 1.0 1.0
HMM 0.076±0.0322 0.623±0.1224 0.5671±0.1424 24.4661±2.5251 25.4312±2.5599 0.3876±0.0107 0.3795±0.0107 0.5754±0.1224 0.5681±0.1218 0.2065±0.0481 0.049±0.018 0.8466±0.0403 0.8104±0.0507 0.9024±0.0489 0.9994±0.001
NGram 0.2376±0.0025 0.974±0.0108 0.9217±0.0019 5.5069±0.1027 6.2306±0.0966 0.5209±0.001 0.4997±0.0005 0.9846±0.0012 0.9815±0.0012 0.5302±0.0163 0.0977±0.0142 0.8738±0.0002 0.8644±0.0002 0.9582±0.001 0.9694±0.001
Combinatorial 1.0±0.0 0.9983±0.0015 0.9909±0.0009 4.2375±0.037 4.5113±0.0274 0.4514±0.0003 0.4388±0.0002 0.9912±0.0004 0.9904±0.0003 0.4445±0.0056 0.0865±0.0027 0.8732±0.0002 0.8666±0.0002 0.9557±0.0018 0.9878±0.0008
CharRNN 0.9748±0.0264 1.0±0.0 0.9994±0.0003 0.0732±0.0247 0.5204±0.0379 0.6015±0.0206 0.5649±0.0142 0.9998±0.0002 0.9983±0.0003 0.9242±0.0058 0.1101±0.0081 0.8562±0.0005 0.8503±0.0005 0.9943±0.0034 0.8419±0.0509
AAE 0.9368±0.0341 1.0±0.0 0.9973±0.002 0.5555±0.2033 1.0572±0.2375 0.6081±0.0043 0.5677±0.0045 0.991±0.0051 0.9905±0.0039 0.9022±0.0375 0.0789±0.009 0.8557±0.0031 0.8499±0.003 0.996±0.0006 0.7931±0.0285
VAE 0.9767±0.0012 1.0±0.0 0.9984±0.0005 0.099±0.0125 0.567±0.0338 0.6257±0.0005 0.5783±0.0008 0.9994±0.0001 0.9984±0.0003 0.9386±0.0021 0.0588±0.0095 0.8558±0.0004 0.8498±0.0004 0.997±0.0002 0.6949±0.0069
JTN-VAE 1.0±0.0 1.0±0.0 0.9996±0.0003 0.3954±0.0234 0.9382±0.0531 0.5477±0.0076 0.5194±0.007 0.9965±0.0003 0.9947±0.0002 0.8964±0.0039 0.1009±0.0105 0.8551±0.0034 0.8493±0.0035 0.976±0.0016 0.9143±0.0058
LatentGAN 0.8966±0.0029 1.0±0.0 0.9968±0.0002 0.2968±0.0087 0.8281±0.0117 0.5371±0.0004 0.5132±0.0002 0.9986±0.0004 0.9972±0.0007 0.8867±0.0009 0.1072±0.0098 0.8565±0.0007 0.8505±0.0006 0.9735±0.0006 0.9498±0.0006

For comparison of molecular properties, we computed the Wasserstein-1 distance between distributions of molecules in the generated and test sets. Below, we provide plots for lipophilicity (logP), Synthetic Accessibility (SA), Quantitative Estimation of Drug-likeness (QED) and molecular weight.

logP SA
logP SA
weight QED
weight QED

Installation

PyPi

The simplest way to install MOSES (models and metrics) is to install RDKit: conda install -yq -c rdkit rdkit and then install MOSES (molsets) from pip (pip install molsets). If you want to use LatentGAN, you should also install additional dependencies using bash install_latentgan_dependencies.sh.

If you are using Ubuntu, you should also install sudo apt-get install libxrender1 libxext6 for RDKit.

Docker

  1. Install docker and nvidia-docker.

  2. Pull an existing image (4.1Gb to download) from DockerHub:

docker pull molecularsets/moses

or clone the repository and build it manually:

git clone https://github.com/molecularsets/moses.git
nvidia-docker image build --tag molecularsets/moses moses/
  1. Create a container:
nvidia-docker run -it --name moses --network="host" --shm-size 10G molecularsets/moses
  1. The dataset and source code are available inside the docker container at /moses:
docker exec -it molecularsets/moses bash

Manually

Alternatively, install dependencies and MOSES manually.

  1. Clone the repository:
git lfs install
git clone https://github.com/molecularsets/moses.git
  1. Install RDKit for metrics calculation.

  2. Install MOSES:

python setup.py install
  1. (Optional) Install dependencies for LatentGAN:
bash install_latentgan_dependencies.sh

Benchmarking your models

  • Install MOSES as described in the previous section.

  • Get train, test and test_scaffolds datasets using the following code:

import moses

train = moses.get_dataset('train')
test = moses.get_dataset('test')
test_scaffolds = moses.get_dataset('test_scaffolds')
  • You can use a standard torch DataLoader in your models. We provide a simple StringDataset class for convenience:
from torch.utils.data import DataLoader
from moses import CharVocab, StringDataset

train = moses.get_dataset('train')
vocab = CharVocab.from_data(train)
train_dataset = StringDataset(vocab, train)
train_dataloader = DataLoader(
    train_dataset, batch_size=512,
    shuffle=True, collate_fn=train_dataset.default_collate
)

for with_bos, with_eos, lengths in train_dataloader:
    ...
  • Calculate metrics from your model's samples. We recomend sampling at least 30,000 molecules:
import moses
metrics = moses.get_all_metrics(list_of_generated_smiles)
  • Add generated samples and metrics to your repository. Run the experiment multiple times to estimate the variance of the metrics.

Reproducing the baselines

End-to-End launch

You can run pretty much everything with:

python scripts/run.py

This will split the dataset, train the models, generate new molecules, and calculate the metrics. Evaluation results will be saved in metrics.csv.

You can specify the GPU device index as cuda:n (or cpu for CPU) and/or model by running:

python scripts/run.py --device cuda:1 --model aae

For more details run python scripts/run.py --help.

You can reproduce evaluation of all models with several seeds by running:

sh scripts/run_all_models.sh

Training

python scripts/train.py <model name> \
       --train_load <train dataset> \
       --model_save <path to model> \
       --config_save <path to config> \
       --vocab_save <path to vocabulary>

To get a list of supported models run python scripts/train.py --help.

For more details of certain model run python scripts/train.py --help .

Generation

python scripts/sample.py <model name> \
       --model_load <path to model> \
       --vocab_load <path to vocabulary> \
       --config_load <path to config> \
       --n_samples <number of samples> \
       --gen_save <path to generated dataset>

To get a list of supported models run python scripts/sample.py --help.

For more details of certain model run python scripts/sample.py --help .

Evaluation

python scripts/eval.py \
       --ref_path <reference dataset> \
       --gen_path <generated dataset>

For more details run python scripts/eval.py --help.

Owner
Neelesh C A
Neelesh C A
Implementation of a protein autoregressive language model, but with autoregressive infilling objective (editing subsequences capability)

Protein GLM (wip) Implementation of a protein autoregressive language model, but with autoregressive infilling objective (editing subsequences capabil

Phil Wang 17 May 06, 2022
A python3 tool to take a 360 degree survey of the RF spectrum (hamlib + rotctld + RTL-SDR/HackRF)

RF Light House (rflh) A python script to use a rotor and a SDR device (RTL-SDR or HackRF One) to measure the RF level around and get a data set and be

Pavel Milanes (CO7WT) 11 Dec 13, 2022
The implementation of 'Image synthesis via semantic composition'.

Image synthesis via semantic synthesis [Project Page] by Yi Wang, Lu Qi, Ying-Cong Chen, Xiangyu Zhang, Jiaya Jia. Introduction This repository gives

DV Lab 71 Jan 06, 2023
CLOOB: Modern Hopfield Networks with InfoLOOB Outperform CLIP

CLOOB: Modern Hopfield Networks with InfoLOOB Outperform CLIP Andreas Fürst* 1, Elisabeth Rumetshofer* 1, Viet Tran1, Hubert Ramsauer1, Fei Tang3, Joh

Institute for Machine Learning, Johannes Kepler University Linz 133 Jan 04, 2023
The official codes for the ICCV2021 Oral presentation "Rethinking Counting and Localization in Crowds: A Purely Point-Based Framework"

P2PNet (ICCV2021 Oral Presentation) This repository contains codes for the official implementation in PyTorch of P2PNet as described in Rethinking Cou

Tencent YouTu Research 208 Dec 26, 2022
Python version of the amazing Reaction Mechanism Generator (RMG).

Reaction Mechanism Generator (RMG) Description This repository contains the Python version of Reaction Mechanism Generator (RMG), a tool for automatic

Reaction Mechanism Generator 284 Dec 27, 2022
MAU: A Motion-Aware Unit for Video Prediction and Beyond, NeurIPS2021

MAU (NeurIPS2021) Zheng Chang, Xinfeng Zhang, Shanshe Wang, Siwei Ma, Yan Ye, Xinguang Xiang, Wen GAo. Official PyTorch Code for "MAU: A Motion-Aware

ZhengChang 20 Nov 25, 2022
LoFTR:Detector-Free Local Feature Matching with Transformers CVPR 2021

LoFTR-with-train-script LoFTR:Detector-Free Local Feature Matching with Transformers CVPR 2021 (with train script --- unofficial ---). About Megadepth

Nan Xiaohu 15 Nov 04, 2022
Official implementation of NPMs: Neural Parametric Models for 3D Deformable Shapes - ICCV 2021

NPMs: Neural Parametric Models Project Page | Paper | ArXiv | Video NPMs: Neural Parametric Models for 3D Deformable Shapes Pablo Palafox, Aljaz Bozic

PabloPalafox 109 Nov 22, 2022
HuSpaCy: industrial-strength Hungarian natural language processing

HuSpaCy: Industrial-strength Hungarian NLP HuSpaCy is a spaCy model and a library providing industrial-strength Hungarian language processing faciliti

HuSpaCy 120 Dec 14, 2022
Object Detection Projekt in GKI WS2021/22

tfObjectDetection Object Detection Projekt with tensorflow in GKI WS2021/22 Docker Container: docker run -it --name --gpus all -v path/to/project:p

Tim Eggers 1 Jul 18, 2022
CCCL: Contrastive Cascade Graph Learning.

CCGL: Contrastive Cascade Graph Learning This repo provides a reference implementation of Contrastive Cascade Graph Learning (CCGL) framework as descr

Xovee Xu 19 Dec 05, 2022
Training Certifiably Robust Neural Networks with Efficient Local Lipschitz Bounds (Local-Lip)

Training Certifiably Robust Neural Networks with Efficient Local Lipschitz Bounds (Local-Lip) Introduction TL;DR: We propose an efficient and trainabl

17 Dec 01, 2022
(EI 2022) Controllable Confidence-Based Image Denoising

Image Denoising with Control over Deep Network Hallucination Paper and arXiv preprint -- Our frequency-domain insights derive from SFM and the concept

Images and Visual Representation Laboratory (IVRL) at EPFL 5 Dec 18, 2022
TCTrack: Temporal Contexts for Aerial Tracking (CVPR2022)

TCTrack: Temporal Contexts for Aerial Tracking (CVPR2022) Ziang Cao and Ziyuan Huang and Liang Pan and Shiwei Zhang and Ziwei Liu and Changhong Fu In

Intelligent Vision for Robotics in Complex Environment 100 Dec 19, 2022
PyTorch code for Composing Partial Differential Equations with Physics-Aware Neural Networks

FInite volume Neural Network (FINN) This repository contains the PyTorch code for models, training, and testing, and Python code for data generation t

Cognitive Modeling 20 Dec 18, 2022
Fully Connected DenseNet for Image Segmentation

Fully Connected DenseNets for Semantic Segmentation Fully Connected DenseNet for Image Segmentation implementation of the paper The One Hundred Layers

Somshubra Majumdar 84 Oct 31, 2022
Precomputed Real-Time Texture Synthesis with Markovian Generative Adversarial Networks

MGANs Training & Testing code (torch), pre-trained models and supplementary materials for "Precomputed Real-Time Texture Synthesis with Markovian Gene

290 Nov 15, 2022
Pytorch implementation of PCT: Point Cloud Transformer

PCT: Point Cloud Transformer This is a Pytorch implementation of PCT: Point Cloud Transformer.

Yi_Zhang 265 Dec 22, 2022
Implementation of the paper All Labels Are Not Created Equal: Enhancing Semi-supervision via Label Grouping and Co-training

SemCo The official pytorch implementation of the paper All Labels Are Not Created Equal: Enhancing Semi-supervision via Label Grouping and Co-training

42 Nov 14, 2022