Improving Compound Activity Classification via Deep Transfer and Representation Learning

Overview

Improving Compound Activity Classification via Deep Transfer and Representation Learning

This repository is the official implementation of Improving Compound Activity Classification via Deep Transfer and Representation Learning.

Requirements

Operating systems: Red Hat Enterprise Linux Server 7.9

To install requirements:

pip install -r requirements.txt

Installation guide

Download the code and dataset with the command:

git clone https://github.com/ninglab/TransferAct.git

Data Processing

1. Use provided processed dataset

One can use our provided processed dataset in ./data/pairs/: the dataset of pairs of processed balanced assays $\mathcal{P}$ . Check the details of bioassay selection, processing, and assay pair selection in our paper in Section 5.1.1 and Section 5.1.2, respectively. We provided our dataset of pairs as data/pairs.tar.gz compressed file. Please use tar to de-compress it.

2. Use own dataset

We provide necessary scripts in ./data/scripts/ with the processing steps in ./data/scripts/README.md.

Training

1. Running TAc

  • To run TAc-dmpn,
python code/train_aada.py --source_data_path <source_assay_csv_file> --target_data_path <target_assay_csv_file> --dataset_type classification --extra_metrics prc-auc precision recall accuracy f1_score --hidden_size 25 --depth 4 --init_lr 1e-3 --batch_size 10 --ffn_hidden_size 100 --ffn_num_layers 2 --epochs 40 --alpha 1 --lamda 0 --split_type index_predetermined --crossval_index_file <index_file> --save_dir <chkpt_dir> --class_balance --mpn_shared
  • To run TAc-dmpna, add these arguments to the above command
--attn_dim 100 --aggregation self-attention --model aada_attention

source_data_path and target_data_path specify the path to the source and target assay CSV files of the pair, respectively. First line contains a header smiles,target. Each of the following lines are comma-separated with the SMILES in the 1st column and the 0/1 label in the 2nd column.

dataset_type specifies the type of task; always classification for this project.

extra_metrics specifies the list of evaluation metrics.

hidden_size specifies the dimension of the learned compound representation out of GNN-based feature generators.

depth specifies the number of message passing steps.

init_lr specifies the initial learning rate.

batch_size specifies the batch size.

ffn_hidden_size and ffn_num_layers specify the number of hidden units and layers, respectively, in the fully connected network used as the classifier.

epochs specifies the total number of epochs.

split_type specifies the type of data split.

crossval_index_file specifies the path to the index file which contains the indices of data points for train, validation and test split for each fold.

save_dir specifies the directory where the model, evaluation scores and predictions will be saved.

class_balance indicates whether to use class-balanced batches during training.

model specifies which model to use.

aggregation specifies which pooling mechanism to use to get the compound representation from the atom representations. Default set to mean: the atom-level representations from the message passing network are averaged over all atoms of a compound to yield the compound representation.

attn_dim specifies the dimension of the hidden layer in the 2-layer fully connected network used as the attention network.

Use python code/train_aada.py -h to check the meaning and default values of other parameters.

2. Running TAc-fc variants and ablations

  • To run Tac-fc,
python code/train_aada.py --source_data_path <source_assay_csv_file> --target_data_path <target_assay_csv_file> --dataset_type classification --extra_metrics prc-auc precision recall accuracy f1_score --hidden_size 25 --depth 4 --init_lr 1e-3 --batch_size 10 --ffn_hidden_size 100 --ffn_num_layers 2 --local_discriminator_hidden_size 100 --local_discriminator_num_layers 2 --global_discriminator_hidden_size 100 --global_discriminator_num_layers 2 --epochs 40 --alpha 1 --lamda 1 --split_type index_predetermined --crossval_index_file <index_file> --save_dir <chkpt_dir> --class_balance --mpn_shared
  • To run TAc-fc-dmpna, add these arguments to the above command
--attn_dim 100 --aggregation self-attention --model aada_attention
Ablations
  • To run TAc-f, add --exclude_global to the above command.
  • To run TAc-c, add --exclude_local to the above command.
  • Adding both --exclude_local and --exclude_global is equivalent to running TAc.

3. Running Baselines

DANN

python code/train_aada.py --source_data_path <source_assay_csv_file> --target_data_path <target_assay_csv_file> --dataset_type classification --extra_metrics prc-auc precision recall accuracy f1_score --hidden_size 25 --depth 4 --init_lr 1e-3 --batch_size 10 --ffn_hidden_size 100 --ffn_num_layers 2 --global_discriminator_hidden_size 100 --global_discriminator_num_layers 2 --epochs 40 --alpha 1 --lamda 1 --split_type index_predetermined --crossval_index_file <index_file> --save_dir <chkpt_dir> --class_balance --mpn_shared
  • To run DANN-dmpn, add --model dann to the above command.
  • To run DANN-dmpna, add --model dann_attention --attn_dim 100 --aggregation self-attention --model to the above command.

Run the following baselines from chemprop as follows:

FCN-morgan

python chemprop/train.py --data_path <assay_csv_file> --dataset_type classification --extra_metrics prc-auc precision recall accuracy f1_score --init_lr 1e-3 --batch_size 10 --ffn_hidden_size 100 --ffn_num_layers 2 --epochs 40 --features_generator morgan --features_only --split_type index_predetermined --crossval_index_file <index_file> --save_dir <chkpt_dir> --class_balance

FCN-morganc

python chemprop/train.py --data_path <assay_csv_file> --dataset_type classification --extra_metrics prc-auc precision recall accuracy f1_score --init_lr 1e-3 --batch_size 10 --ffn_hidden_size 100 --ffn_num_layers 2 --epochs 40 --features_generator morgan_count --features_only --split_type index_predetermined --crossval_index_file <index_file> --save_dir <chkpt_dir> --class_balance

FCN-dmpn

python chemprop/train.py --data_path <assay_csv_file> --dataset_type classification --extra_metrics prc-auc precision recall accuracy f1_score --hidden_size 25 --depth 4 --init_lr 1e-3 --batch_size 10 --ffn_hidden_size 100 --ffn_num_layers 2 --epochs 40 --split_type index_predetermined --crossval_index_file <index_file> --save_dir <chkpt_dir> --class_balance

FCN-dmpna

Add the following to the above command:

--model mpnn_attention --attn_dim 100 --aggregation self-attention

For the above baselines, data_path specifies the path to the target assay CSV file.

FCN-dmpn(DT)

python chemprop/train.py --data_path <source_assay_csv_file> --target_data_path <target_assay_csv_file> --dataset_type classification --extra_metrics prc-auc precision recall accuracy f1_score  --hidden_size 25 --depth 4 --init_lr 1e-3 --batch_size 10 --ffn_hidden_size 100 --ffn_num_layers 2 --epochs 40 --split_type index_predetermined --crossval_index_file <index_file> --save_dir <chkpt_dir> --class_balance

FCN-dmpna(DT)

--model mpnn_attention --attn_dim 100 --aggregation self-attention

For FCN-dmpn(DT)and FCN-dmpna(DT), data_path and target_data_path specify the path to the source and target assay CSV files.

Use python chemprop/train.py -h to check the meaning of other parameters.

Testing

  1. To predict the labels of the compounds in the test set for Tac*, DANN methods:

    python code/predict.py --test_path <test_csv_file> --checkpoint_dir <chkpt_dir> --preds_path <pred_file>

    test_path specifies the path to a CSV file containing a list of SMILES and ground-truth labels. First line contains a header smiles,target. Each of the following lines are comma-separated with the SMILES in the 1st column and the 0/1 label in the 2nd column.

    checkpoint_dir specifies the path to the checkpoint directory where the model checkpoint(s) .pt filles are saved (i.e., save_dir during training).

    preds_path specifies the path to a CSV file where the predictions will be saved.

  2. To predict the labels of the compounds in the test set for other methods:

    python chemprop/predict.py --test_path <test_csv_file> --checkpoint_dir <chkpt_dir> --preds_path <pred_file>
    

Compound Prioritization using dmpna:

Please refer to the README.md in the comprank directory.

Owner
NingLab
NingLab
Relaxed-machines - explorations in neuro-symbolic differentiable interpreters

Relaxed Machines Explorations in neuro-symbolic differentiable interpreters. Baby steps: inc_stop Libraries JAX Haiku Optax Resources Chapter 3 (∂4: A

Nada Amin 6 Feb 02, 2022
A JAX-based research framework for writing differentiable numerical simulators with arbitrary discretizations

jaxdf - JAX-based Discretization Framework Overview | Example | Installation | Documentation ⚠️ This library is still in development. Breaking changes

UCL Biomedical Ultrasound Group 65 Dec 23, 2022
Second-order Attention Network for Single Image Super-resolution (CVPR-2019)

Second-order Attention Network for Single Image Super-resolution (CVPR-2019) "Second-order Attention Network for Single Image Super-resolution" is pub

516 Dec 28, 2022
BRepNet: A topological message passing system for solid models

BRepNet: A topological message passing system for solid models This repository contains the an implementation of BRepNet: A topological message passin

Autodesk AI Lab 42 Dec 30, 2022
Source code of the paper "Deep Learning of Latent Variable Models for Industrial Process Monitoring".

Source code of the paper "Deep Learning of Latent Variable Models for Industrial Process Monitoring".

Xiangyin Kong 7 Nov 08, 2022
Codes of paper "Unseen Object Amodal Instance Segmentation via Hierarchical Occlusion Modeling"

Unseen Object Amodal Instance Segmentation (UOAIS) Seunghyeok Back, Joosoon Lee, Taewon Kim, Sangjun Noh, Raeyoung Kang, Seongho Bak, Kyoobin Lee This

GIST-AILAB 92 Dec 13, 2022
Course materials for Fall 2021 "CIS6930 Topics in Computing for Data Science" at New College of Florida

Fall 2021 CIS6930 Topics in Computing for Data Science This repository hosts course materials used for a 13-week course "CIS6930 Topics in Computing f

Yoshi Suhara 101 Nov 30, 2022
🔥🔥High-Performance Face Recognition Library on PaddlePaddle & PyTorch🔥🔥

face.evoLVe: High-Performance Face Recognition Library based on PaddlePaddle & PyTorch Evolve to be more comprehensive, effective and efficient for fa

Zhao Jian 3.1k Jan 02, 2023
Deeply Supervised, Layer-wise Prediction-aware (DSLP) Transformer for Non-autoregressive Neural Machine Translation

Non-Autoregressive Translation with Layer-Wise Prediction and Deep Supervision Training Efficiency We show the training efficiency of our DSLP model b

Chenyang Huang 36 Oct 31, 2022
Neural Articulated Radiance Field

Neural Articulated Radiance Field NARF Neural Articulated Radiance Field Atsuhiro Noguchi, Xiao Sun, Stephen Lin, Tatsuya Harada ICCV 2021 [Paper] [Co

Atsuhiro Noguchi 144 Jan 03, 2023
Fewshot-face-translation-GAN - Generative adversarial networks integrating modules from FUNIT and SPADE for face-swapping.

Few-shot face translation A GAN based approach for one model to swap them all. The table below shows our priliminary face-swapping results requiring o

768 Dec 24, 2022
TransferNet: Learning Transferrable Knowledge for Semantic Segmentation with Deep Convolutional Neural Network

TransferNet: Learning Transferrable Knowledge for Semantic Segmentation with Deep Convolutional Neural Network Created by Seunghoon Hong, Junhyuk Oh,

42 Jun 29, 2022
Code for the Higgs Boson Machine Learning Challenge organised by CERN & EPFL

A method to solve the Higgs boson challenge using Least Squares - Novae This project is the Project 1 of EPFL CS-433 Machine Learning. The project is

Giacomo Orsi 1 Nov 09, 2021
Code and Experiments for ACL-IJCNLP 2021 Paper Mind Your Outliers! Investigating the Negative Impact of Outliers on Active Learning for Visual Question Answering.

Code and Experiments for ACL-IJCNLP 2021 Paper Mind Your Outliers! Investigating the Negative Impact of Outliers on Active Learning for Visual Question Answering.

Sidd Karamcheti 50 Nov 16, 2022
FocusFace: Multi-task Contrastive Learning for Masked Face Recognition

FocusFace This is the official repository of "FocusFace: Multi-task Contrastive Learning for Masked Face Recognition" accepted at IEEE International C

Pedro Neto 21 Nov 17, 2022
YOLOv5 + ROS2 object detection package

YOLOv5-ROS YOLOv5 + ROS2 object detection package This program changes the input of detect.py (ultralytics/yolov5) to sensor_msgs/Image of ROS2. Requi

Ar-Ray 23 Dec 19, 2022
내가 보려고 정리한 <프로그래밍 기초 Ⅰ> / organized for me

Programming-Basics 프로그래밍 기초 Ⅰ 아카이브 Do it! 점프 투 파이썬 주차 강의주제 비고 1주차 Syllabus 2주차 자료형 - 숫자형 3주차 자료형 - 문자열형 4주차 입력과 출력 5주차 제어문 - 조건문 if 6주차 제어문 - 반복문 whil

KIMMINSEO 1 Mar 07, 2022
This is the code for Compressing BERT: Studying the Effects of Weight Pruning on Transfer Learning

This is the code for Compressing BERT: Studying the Effects of Weight Pruning on Transfer Learning It includes /bert, which is the original BERT repos

Mitchell Gordon 11 Nov 15, 2022
implementation for paper "ShelfNet for fast semantic segmentation"

ShelfNet-lightweight for paper (ShelfNet for fast semantic segmentation) This repo contains implementation of ShelfNet-lightweight models for real-tim

Juntang Zhuang 252 Sep 16, 2022