FS-Mol: A Few-Shot Learning Dataset of Molecules

Related tags

Deep LearningFS-Mol
Overview

FS-Mol: A Few-Shot Learning Dataset of Molecules

This repository contains data and code for FS-Mol: A Few-Shot Learning Dataset of Molecules.

Installation

  1. Clone or download this repository

  2. Install dependencies

    cd FS-Mol
    
    conda env create -f environment.yml
    conda activate fsmol
    

The code for the Molecule Attention Transformer baseline is added as a submodule of this repository. Hence, in order to be able to run MAT, one has to clone our repository via git clone --recurse-submodules. Alternatively, one can first clone our repository normally, and then set up submodules via git submodule update --init. If the MAT submodule is not set up, all the other parts of our repository should continue to work.

Data

The dataset is available as a download, FS-Mol Data, split into train, valid and test folders. Additionally, we specify which tasks are to be used with the file datasets/fsmol-0.1.json, a default list of tasks for each data fold. We note that the complete dataset contains many more tasks. Should use of all possible training tasks available be desired, the training script argument --task_list_file datasets/entire_train_set.json should be used. The task lists will be used to version FS-Mol in future iterations as more data becomes available via ChEMBL.

Tasks are stored as individual compressed JSONLines files, with each line corresponding to the information to a single datapoint for the task. Each datapoint is stored as a JSON dictionary, following a fixed structure:

{
    "SMILES": "SMILES_STRING",
    "Property": "ACTIVITY BOOL LABEL",
    "Assay_ID": "CHEMBL ID",
    "RegressionProperty": "ACTIVITY VALUE",
    "LogRegressionProperty": "LOG ACTIVITY VALUE",
    "Relation": "ASSUMED RELATION OF MEASURED VALUE TO TRUE VALUE",
    "AssayType": "TYPE OF ASSAY",
    "fingerprints": [...],
    "descriptors": [...],
    "graph": {
        "adjacency_lists": [
           [... SINGLE BONDS AS PAIRS ...],
           [... DOUBLE BONDS AS PAIRS ...],
           [... TRIPLE BONDS AS PAIRS ...]
        ],
        "node_types": [...ATOM TYPES...],
        "node_features": [...NODE FEATURES...],
    }
}

FSMolDataset

The fs_mol.data.FSMolDataset class provides programmatic access in Python to the train/valid/test tasks of the few-shot dataset. An instance is created from the data directory by FSMolDataset.from_directory(/path/to/dataset). More details and examples of how to use FSMolDataset are available in fs_mol/notebooks/dataset.ipynb.

Evaluating a new Model

We have provided an implementation of the FS-Mol evaluation methodology in fs_mol.utils.eval_utils.eval_model(). This is a framework-agnostic python method, and we demonstrate how to use it for evaluating a new model in detail in notebooks/evaluation.ipynb.

Note that our baseline test scripts (fs_mol/baseline_test.py, fs_mol/maml_test.py, fs_mol/mat_test, fs_mol/multitask_test.py and fs_mol/protonet_test.py) use this method as well and can serve as examples on how to integrate per-task fine-tuning in TensorFlow (maml_test.py), fine-tuning in PyTorch (mat_test.py) and single-task training for scikit-learn models (baseline_test.py). These scripts also support the --task_list_file parameter to choose different sets of test tasks, as required.

Baseline Model Implementations

We provide implementations for three key few-shot learning methods: Multitask learning, Model-Agnostic Meta-Learning, and Prototypical Networks, as well as evaluation on the Single-Task baselines and the Molecule Attention Transformer (MAT) paper, code.

All results and associated plots are found in the baselines/ directory.

These baseline methods can be run on the FS-Mol dataset as follows:

kNNs and Random Forests -- Single Task Baselines

Our kNN and RF baselines are obtained by permitting grid-search over a industry-standard parameter set, detailed in the script baseline_test.py.

The baseline single-task evaluation can be run as follows, with a choice of kNN or randomForest model:

python fs_mol/baseline_test.py /path/to/data --model {kNN, randomForest}

Molecule Attention Transformer

The Molecule Attention Transformer (MAT) paper, code.

The Molecule Attention Transformer can be evaluated as:

python fs_mol/mat_test.py /path/to/pretrained-mat /path/to/data

GNN-MAML pre-training and evaluation

The GNN-MAML model consists of a GNN operating on the molecular graph representations of the dataset. The model consists of a $8$-layer GNN with node-embedding dimension $128$. The GNN uses "Edge-MLP" message passing. The model was trained with a support set size of $16$ according to the MAML procedure Finn 2017. The hyperparameters used in the model checkpoint are default settings of maml_train.py.

The current defaults were used to train the final versions of GNN-MAML available here.

python fs_mol/maml_train.py /path/to/data 

Evaluation is run as:

python fs_mol/maml_test.py /path/to/data --trained_model /path/to/gnn-maml-checkpoint

GNN-MT pre-training and evaluation

The GNN-MT model consists of a GNN operating on the molecular graph representations of the dataset. The model consists of a $10$-layer GNN with node-embedding dimension $128$. The model uses principal neighbourhood aggregation (PNA) message passing. The hyperparameters used in the model checkpoint are default settings of multitask_train.py. This method has similarities to the approach taken for the task-only training contained within Hu 2019

python fs_mol/multitask_train.py /path/to/data 

Evaluation is run as:

python fs_mol/multitask_test.py /path/to/gnn-mt-checkpoint /path/to/data

Prototypical Networks (PN) pre-training and evaluation

The prototypical networks method Snell 2017 extracts representations of support set datapoints and uses these to classify positive and negative examples. We here used the Mahalonobis distance as a metric for query point distance to class prototypes.

python fs_mol/protonet_train.py /path/to/data 

Evaluation is run as:

python fs_mol/protonet_test.py /path/to/pn-checkpoint /path/to/data

Available Model Checkpoints

We provide pre-trained models for GNN-MAML, GNN-MT and PN, these are downloadable from the links to figshare.

Model Name Description Checkpoint File
GNN-MAML Support set size 16. 8-layer GNN. Edge MLP message passing. MAML-Support16_best_validation.pkl
GNN-MT 10-layer GNN. PNA message passing multitask_best_model.pt
PN 10-layer GGN, PNA message passing. ECFP+GNN, Mahalonobis distance metric PN-Support64_best_validation.pt

Specifying, Training and Evaluating New Model Implementations

Flexible definition of few-shot models and single task models is defined as demonstrated in the range of train and test scripts in fs_mol.

We give a detailed example of how to use the abstract class AbstractTorchFSMolModel in notebooks/integrating_torch_models.ipynb to integrate a new general PyTorch model, and note that the evaluation procedure described below is demonstrated on sklearn models in fs_mol/baseline_test.py and on a Tensorflow-based GNN model in fs_mol/maml_test.py.

Contributing

This project welcomes contributions and suggestions. Most contributions require you to agree to a Contributor License Agreement (CLA) declaring that you have the right to, and actually do, grant us the rights to use your contribution. For details, visit https://cla.opensource.microsoft.com.

When you submit a pull request, a CLA bot will automatically determine whether you need to provide a CLA and decorate the PR appropriately (e.g., status check, comment). Simply follow the instructions provided by the bot. You will only need to do this once across all repos using our CLA.

This project has adopted the Microsoft Open Source Code of Conduct. For more information see the Code of Conduct FAQ or contact [email protected] with any additional questions or comments.

Trademarks

This project may contain trademarks or logos for projects, products, or services. Authorized use of Microsoft trademarks or logos is subject to and must follow Microsoft's Trademark & Brand Guidelines. Use of Microsoft trademarks or logos in modified versions of this project must not cause confusion or imply Microsoft sponsorship. Any use of third-party trademarks or logos are subject to those third-party's policies.

Owner
Microsoft
Open source projects and samples from Microsoft
Microsoft
Library for machine learning stacking generalization.

stacked_generalization Implemented machine learning *stacking technic[1]* as handy library in Python. Feature weighted linear stacking is also availab

114 Jul 19, 2022
Face Identity Disentanglement via Latent Space Mapping [SIGGRAPH ASIA 2020]

Face Identity Disentanglement via Latent Space Mapping Description Official Implementation of the paper Face Identity Disentanglement via Latent Space

150 Dec 07, 2022
Graph Robustness Benchmark: A scalable, unified, modular, and reproducible benchmark for evaluating the adversarial robustness of Graph Machine Learning.

Homepage | Paper | Datasets | Leaderboard | Documentation Graph Robustness Benchmark (GRB) provides scalable, unified, modular, and reproducible evalu

THUDM 66 Dec 22, 2022
A large dataset of 100k Google Satellite and matching Map images, resembling pix2pix's Google Maps dataset.

Larger Google Sat2Map dataset This dataset extends the aerial ⟷ Maps dataset used in pix2pix (Isola et al., CVPR17). The provide script download_sat2m

34 Dec 28, 2022
A torch.Tensor-like DataFrame library supporting multiple execution runtimes and Arrow as a common memory format

TorchArrow (Warning: Unstable Prototype) This is a prototype library currently under heavy development. It does not currently have stable releases, an

Facebook Research 536 Jan 06, 2023
Meta Representation Transformation for Low-resource Cross-lingual Learning

MetaXL: Meta Representation Transformation for Low-resource Cross-lingual Learning This repo hosts the code for MetaXL, published at NAACL 2021. [Meta

Microsoft 36 Aug 17, 2022
Explicable Reward Design for Reinforcement Learning Agents [NeurIPS'21]

Explicable Reward Design for Reinforcement Learning Agents [NeurIPS'21]

3 May 12, 2022
Iran Open Source Hackathon

Iran Open Source Hackathon is an open-source hackathon (duh) with the aim of encouraging participation in open-source contribution amongst Iranian dev

OSS Hackathon 121 Dec 25, 2022
Exploring Visual Engagement Signals for Representation Learning

Exploring Visual Engagement Signals for Representation Learning Menglin Jia, Zuxuan Wu, Austin Reiter, Claire Cardie, Serge Belongie and Ser-Nam Lim C

Menglin Jia 9 Jul 23, 2022
Ranger - a synergistic optimizer using RAdam (Rectified Adam), Gradient Centralization and LookAhead in one codebase

Ranger-Deep-Learning-Optimizer Ranger - a synergistic optimizer combining RAdam (Rectified Adam) and LookAhead, and now GC (gradient centralization) i

Less Wright 1.1k Dec 21, 2022
[CVPR 2021] Involution: Inverting the Inherence of Convolution for Visual Recognition, a brand new neural operator

involution Official implementation of a neural operator as described in Involution: Inverting the Inherence of Convolution for Visual Recognition (CVP

Duo Li 1.3k Dec 28, 2022
Instance-conditional Knowledge Distillation for Object Detection

Instance-conditional Knowledge Distillation for Object Detection This is a MegEngine implementation of the paper "Instance-conditional Knowledge Disti

MEGVII Research 47 Nov 17, 2022
Neon-erc20-example - Example of creating SPL token and wrapping it with ERC20 interface in Neon EVM

Example of wrapping SPL token by ERC2-20 interface in Neon Requirements Install

7 Mar 28, 2022
Universal Adversarial Triggers for Attacking and Analyzing NLP (EMNLP 2019)

Universal Adversarial Triggers for Attacking and Analyzing NLP This is the official code for the EMNLP 2019 paper, Universal Adversarial Triggers for

Eric Wallace 248 Dec 17, 2022
Distilling Motion Planner Augmented Policies into Visual Control Policies for Robot Manipulation (CoRL 2021)

Distilling Motion Planner Augmented Policies into Visual Control Policies for Robot Manipulation [Project website] [Paper] This project is a PyTorch i

Cognitive Learning for Vision and Robotics (CLVR) lab @ USC 6 Feb 28, 2022
This is code of book "Learn Deep Learning with PyTorch"

深度学习入门之PyTorch Learn Deep Learning with PyTorch 非常感谢您能够购买此书,这个github repository包含有深度学习入门之PyTorch的实例代码。由于本人水平有限,在写此书的时候参考了一些网上的资料,在这里对他们表示敬意。由于深度学习的技术在

Xingyu Liao 2.5k Jan 04, 2023
[Pedestron] Generalizable Pedestrian Detection: The Elephant In The Room. @ CVPR2021

Pedestron Pedestron is a MMdetection based repository, that focuses on the advancement of research on pedestrian detection. We provide a list of detec

Irtiza Hasan 594 Jan 05, 2023
Semantic Segmentation with SegFormer on Drone Dataset.

SegFormer_Segmentation Semantic Segmentation with SegFormer on Drone Dataset. You can check out the blog on Medium You can also try out the model with

Praneet 8 Oct 20, 2022
Pytorch implementation of "Get To The Point: Summarization with Pointer-Generator Networks"

About this repository This repo contains an Pytorch implementation for the ACL 2017 paper Get To The Point: Summarization with Pointer-Generator Netwo

wxDai 7 Oct 14, 2022
PyTorch implementation of EigenGAN

PyTorch Implementation of EigenGAN Train python train.py [image_folder_path] --name [experiment name] Test python test.py [ckpt path] --traverse FFH

62 Nov 12, 2022