Improving Compound Activity Classification via Deep Transfer and Representation Learning

Overview

Improving Compound Activity Classification via Deep Transfer and Representation Learning

This repository is the official implementation of Improving Compound Activity Classification via Deep Transfer and Representation Learning.

Requirements

Operating systems: Red Hat Enterprise Linux Server 7.9

To install requirements:

pip install -r requirements.txt

Installation guide

Download the code and dataset with the command:

git clone https://github.com/ninglab/TransferAct.git

Data Processing

1. Use provided processed dataset

One can use our provided processed dataset in ./data/pairs/: the dataset of pairs of processed balanced assays $\mathcal{P}$ . Check the details of bioassay selection, processing, and assay pair selection in our paper in Section 5.1.1 and Section 5.1.2, respectively. We provided our dataset of pairs as data/pairs.tar.gz compressed file. Please use tar to de-compress it.

2. Use own dataset

We provide necessary scripts in ./data/scripts/ with the processing steps in ./data/scripts/README.md.

Training

1. Running TAc

  • To run TAc-dmpn,
python code/train_aada.py --source_data_path <source_assay_csv_file> --target_data_path <target_assay_csv_file> --dataset_type classification --extra_metrics prc-auc precision recall accuracy f1_score --hidden_size 25 --depth 4 --init_lr 1e-3 --batch_size 10 --ffn_hidden_size 100 --ffn_num_layers 2 --epochs 40 --alpha 1 --lamda 0 --split_type index_predetermined --crossval_index_file <index_file> --save_dir <chkpt_dir> --class_balance --mpn_shared
  • To run TAc-dmpna, add these arguments to the above command
--attn_dim 100 --aggregation self-attention --model aada_attention

source_data_path and target_data_path specify the path to the source and target assay CSV files of the pair, respectively. First line contains a header smiles,target. Each of the following lines are comma-separated with the SMILES in the 1st column and the 0/1 label in the 2nd column.

dataset_type specifies the type of task; always classification for this project.

extra_metrics specifies the list of evaluation metrics.

hidden_size specifies the dimension of the learned compound representation out of GNN-based feature generators.

depth specifies the number of message passing steps.

init_lr specifies the initial learning rate.

batch_size specifies the batch size.

ffn_hidden_size and ffn_num_layers specify the number of hidden units and layers, respectively, in the fully connected network used as the classifier.

epochs specifies the total number of epochs.

split_type specifies the type of data split.

crossval_index_file specifies the path to the index file which contains the indices of data points for train, validation and test split for each fold.

save_dir specifies the directory where the model, evaluation scores and predictions will be saved.

class_balance indicates whether to use class-balanced batches during training.

model specifies which model to use.

aggregation specifies which pooling mechanism to use to get the compound representation from the atom representations. Default set to mean: the atom-level representations from the message passing network are averaged over all atoms of a compound to yield the compound representation.

attn_dim specifies the dimension of the hidden layer in the 2-layer fully connected network used as the attention network.

Use python code/train_aada.py -h to check the meaning and default values of other parameters.

2. Running TAc-fc variants and ablations

  • To run Tac-fc,
python code/train_aada.py --source_data_path <source_assay_csv_file> --target_data_path <target_assay_csv_file> --dataset_type classification --extra_metrics prc-auc precision recall accuracy f1_score --hidden_size 25 --depth 4 --init_lr 1e-3 --batch_size 10 --ffn_hidden_size 100 --ffn_num_layers 2 --local_discriminator_hidden_size 100 --local_discriminator_num_layers 2 --global_discriminator_hidden_size 100 --global_discriminator_num_layers 2 --epochs 40 --alpha 1 --lamda 1 --split_type index_predetermined --crossval_index_file <index_file> --save_dir <chkpt_dir> --class_balance --mpn_shared
  • To run TAc-fc-dmpna, add these arguments to the above command
--attn_dim 100 --aggregation self-attention --model aada_attention
Ablations
  • To run TAc-f, add --exclude_global to the above command.
  • To run TAc-c, add --exclude_local to the above command.
  • Adding both --exclude_local and --exclude_global is equivalent to running TAc.

3. Running Baselines

DANN

python code/train_aada.py --source_data_path <source_assay_csv_file> --target_data_path <target_assay_csv_file> --dataset_type classification --extra_metrics prc-auc precision recall accuracy f1_score --hidden_size 25 --depth 4 --init_lr 1e-3 --batch_size 10 --ffn_hidden_size 100 --ffn_num_layers 2 --global_discriminator_hidden_size 100 --global_discriminator_num_layers 2 --epochs 40 --alpha 1 --lamda 1 --split_type index_predetermined --crossval_index_file <index_file> --save_dir <chkpt_dir> --class_balance --mpn_shared
  • To run DANN-dmpn, add --model dann to the above command.
  • To run DANN-dmpna, add --model dann_attention --attn_dim 100 --aggregation self-attention --model to the above command.

Run the following baselines from chemprop as follows:

FCN-morgan

python chemprop/train.py --data_path <assay_csv_file> --dataset_type classification --extra_metrics prc-auc precision recall accuracy f1_score --init_lr 1e-3 --batch_size 10 --ffn_hidden_size 100 --ffn_num_layers 2 --epochs 40 --features_generator morgan --features_only --split_type index_predetermined --crossval_index_file <index_file> --save_dir <chkpt_dir> --class_balance

FCN-morganc

python chemprop/train.py --data_path <assay_csv_file> --dataset_type classification --extra_metrics prc-auc precision recall accuracy f1_score --init_lr 1e-3 --batch_size 10 --ffn_hidden_size 100 --ffn_num_layers 2 --epochs 40 --features_generator morgan_count --features_only --split_type index_predetermined --crossval_index_file <index_file> --save_dir <chkpt_dir> --class_balance

FCN-dmpn

python chemprop/train.py --data_path <assay_csv_file> --dataset_type classification --extra_metrics prc-auc precision recall accuracy f1_score --hidden_size 25 --depth 4 --init_lr 1e-3 --batch_size 10 --ffn_hidden_size 100 --ffn_num_layers 2 --epochs 40 --split_type index_predetermined --crossval_index_file <index_file> --save_dir <chkpt_dir> --class_balance

FCN-dmpna

Add the following to the above command:

--model mpnn_attention --attn_dim 100 --aggregation self-attention

For the above baselines, data_path specifies the path to the target assay CSV file.

FCN-dmpn(DT)

python chemprop/train.py --data_path <source_assay_csv_file> --target_data_path <target_assay_csv_file> --dataset_type classification --extra_metrics prc-auc precision recall accuracy f1_score  --hidden_size 25 --depth 4 --init_lr 1e-3 --batch_size 10 --ffn_hidden_size 100 --ffn_num_layers 2 --epochs 40 --split_type index_predetermined --crossval_index_file <index_file> --save_dir <chkpt_dir> --class_balance

FCN-dmpna(DT)

--model mpnn_attention --attn_dim 100 --aggregation self-attention

For FCN-dmpn(DT)and FCN-dmpna(DT), data_path and target_data_path specify the path to the source and target assay CSV files.

Use python chemprop/train.py -h to check the meaning of other parameters.

Testing

  1. To predict the labels of the compounds in the test set for Tac*, DANN methods:

    python code/predict.py --test_path <test_csv_file> --checkpoint_dir <chkpt_dir> --preds_path <pred_file>

    test_path specifies the path to a CSV file containing a list of SMILES and ground-truth labels. First line contains a header smiles,target. Each of the following lines are comma-separated with the SMILES in the 1st column and the 0/1 label in the 2nd column.

    checkpoint_dir specifies the path to the checkpoint directory where the model checkpoint(s) .pt filles are saved (i.e., save_dir during training).

    preds_path specifies the path to a CSV file where the predictions will be saved.

  2. To predict the labels of the compounds in the test set for other methods:

    python chemprop/predict.py --test_path <test_csv_file> --checkpoint_dir <chkpt_dir> --preds_path <pred_file>
    

Compound Prioritization using dmpna:

Please refer to the README.md in the comprank directory.

Owner
NingLab
NingLab
Efficient Speech Processing Tookit for Automatic Speaker Recognition

Sugar Efficient Speech Processing Tookit for Automatic Speaker Recognition | HuggingFace | What's New EfficientTDNN: Efficient Architecture Search for

WangRui 14 Sep 14, 2022
This Repostory contains the pretrained DTLN-aec model for real-time acoustic echo cancellation.

This Repostory contains the pretrained DTLN-aec model for real-time acoustic echo cancellation.

Nils L. Westhausen 182 Jan 07, 2023
PyTorch Implement of Context Encoders: Feature Learning by Inpainting

Context Encoders: Feature Learning by Inpainting This is the Pytorch implement of CVPR 2016 paper on Context Encoders 1) Semantic Inpainting Demo Inst

321 Dec 25, 2022
Scripts and a shader to get you started on setting up an exported Koikatsu character in Blender.

KK Blender Shader Pack A plugin and a shader to get you started with setting up an exported Koikatsu character in Blender. The plugin is a Blender add

166 Jan 01, 2023
TF Image Segmentation: Image Segmentation framework

TF Image Segmentation: Image Segmentation framework The aim of the TF Image Segmentation framework is to provide/provide a simplified way for: Convert

Daniil Pakhomov 546 Dec 17, 2022
SCALE: Modeling Clothed Humans with a Surface Codec of Articulated Local Elements (CVPR 2021)

SCALE: Modeling Clothed Humans with a Surface Codec of Articulated Local Elements (CVPR 2021) This repository contains the official PyTorch implementa

Qianli Ma 133 Jan 05, 2023
PyTorch implementation for "Sharpness-aware Quantization for Deep Neural Networks".

Sharpness-aware Quantization for Deep Neural Networks Recent Update 2021.11.23: We release the source code of SAQ. Setup the environments Clone the re

Zhuang AI Group 30 Dec 19, 2022
9th place solution

AllDataAreExt-Galixir-Kaggle-HPA-2021-Solution Team Members Qishen Ha is Master of Engineering from the University of Tokyo. Machine Learning Engineer

daishu 5 Nov 18, 2021
Unrestricted Facial Geometry Reconstruction Using Image-to-Image Translation

Unrestricted Facial Geometry Reconstruction Using Image-to-Image Translation [Arxiv] [Video] Evaluation code for Unrestricted Facial Geometry Reconstr

Matan Sela 242 Dec 30, 2022
Official pytorch implementation of "Scaling-up Disentanglement for Image Translation", ICCV 2021.

Official pytorch implementation of "Scaling-up Disentanglement for Image Translation", ICCV 2021.

Aviv Gabbay 41 Nov 29, 2022
Implementation for ACProp ( Momentum centering and asynchronous update for adaptive gradient methdos, NeurIPS 2021)

This repository contains code to reproduce results for submission NeurIPS 2021, "Momentum Centering and Asynchronous Update for Adaptive Gradient Meth

Juntang Zhuang 15 Jun 11, 2022
PyTorch implementation of "A Two-Stage End-to-End System for Speech-in-Noise Hearing Aid Processing"

Implementation of the Sheffield entry for the first Clarity enhancement challenge (CEC1) This repository contains the PyTorch implementation of "A Two

10 Aug 19, 2022
An open-source, low-cost, image-based weed detection device for fallow scenarios.

Welcome to the OpenWeedLocator (OWL) project, an opensource hardware and software green-on-brown weed detector that uses entirely off-the-shelf compon

Guy Coleman 145 Jan 05, 2023
Proposed n-stage Latent Dirichlet Allocation method - A Novel Approach for LDA

n-stage Latent Dirichlet Allocation (n-LDA) Proposed n-LDA & A Novel Approach for classical LDA Latent Dirichlet Allocation (LDA) is a generative prob

Anıl Güven 4 Mar 07, 2022
Baseline for the Spoofing-aware Speaker Verification Challenge 2022

Introduction This repository contains several materials that supplements the Spoofing-Aware Speaker Verification (SASV) Challenge 2022 including: calc

40 Dec 28, 2022
TorchFlare is a simple, beginner-friendly, and easy-to-use PyTorch Framework train your models effortlessly.

TorchFlare TorchFlare is a simple, beginner-friendly and an easy-to-use PyTorch Framework train your models without much effort. It provides an almost

Atharva Phatak 85 Dec 26, 2022
PyTorch package for the discrete VAE used for DALL·E.

Overview [Blog] [Paper] [Model Card] [Usage] This is the official PyTorch package for the discrete VAE used for DALL·E. Installation Before running th

OpenAI 9.5k Jan 05, 2023
Weakly Supervised Scene Text Detection using Deep Reinforcement Learning

Weakly Supervised Scene Text Detection using Deep Reinforcement Learning This repository contains the setup for all experiments performed in our Paper

Emanuel Metzenthin 3 Dec 16, 2022
Realtime YOLO Monster Detection With Non Maximum Supression

Realtime-YOLO-Monster-Detection-With-Non-Maximum-Supression Table of Contents In

5 Oct 07, 2022
Source Code of NeurIPS21 paper: Recognizing Vector Graphics without Rasterization

YOLaT-VectorGraphicsRecognition This repository is the official PyTorch implementation of our NeurIPS-2021 paper: Recognizing Vector Graphics without

Microsoft 49 Dec 20, 2022