Uni-Fold: Training your own deep protein-folding models.

Related tags

Deep LearningUni-Fold
Overview

Uni-Fold: Training your own deep protein-folding models.

This package provides and implementation of a trainable, Transformer-based deep protein folding model. We modified the open-source code of DeepMind AlphaFold v2.0 and provided code to train the model from scratch. See the reference and the repository of DeepMind AlphaFold v2.0. To train your own Uni-Fold models, please follow the steps below:

1. Install the environment.

Run the following code to install the dependencies of Uni-Fold:

  conda create -n unifold python=3.8.10 -y
  conda activate unifold
  ./install_dependencies.sh

Uni-Fold has been tested for Python 3.8.10, CUDA 11.1 and OpenMPI 4.1.1. We recommend using Conda >= 4.10 when installing the environment: using Conda with lower level may lead to some conflicts between packages.

2. Prepare data before training.

Before you start to train your own folding models, you shall prepare the features and labels of the training proteins. Features of proteins mainly include the amino acid sequence, MSAs and templates of proteins. These messages should be contained in a pickle file /features.pkl for each training protein. Uni-Fold provides scripts to process input FASTA files, relying on several external databases and tools. Labels are CIF files containing the structures of the proteins.

2.1 Datasets and external tools.

Uni-Fold adopts the same data processing pipeline as AlphaFold2. We kept the scripts of downloading corresponding databases for searching sequence homologies and templates in the AlphaFold2 repo. Use the command

  bash scripts/download_all_data.sh /path/to/database/directory

to download all required databases of Uni-Fold.

If you successfully installed the Conda environment in Section 1, external tools of search homogenous sequences and templates should be installed properly. As an alternative, you can customize the parameters of feature preparation script to refer to your own databases and tools.

2.2 Run the preparation code.

An example command of running the feature preparation pipeline would be

  python generate_pkl_features.py \
    --fasta_dir ./example_data/fasta \
    --output_dir ./out \
    --data_dir /path/to/database/directory \
    --num_workers 1

This command automatically processes all FASTA files under fasta_dir, and dumps the results to output_dir. Note that each FASTA file should contain only one sequence. The default number of cpu used in hhblits and jackhmmer are 4 and 8. You can modify them in unifold/data/tools/hhblits.py and unifold/data/tools/jackhmmer.py, respectively.

2.3 Organize your training data.

Uni-Fold uses the class DataSystem to automatically sample and load the training entries. To make everything goes right, you shall pay attention to how the training data is organized. Two directories should be established, one with input features (features.pkl files, referred as features_dir) and the other with labels (*.cif files, referred as mmcif_dir). The feature directory should have its files named as _ _ /features.pkl , and the label directory should have its files named as .cif . Users shall make sure that all proteins used for training have their corresponding labels. See ./example_data/features and ./example_data/mmcif for instances of features_dir and mmcif_dir.

3. Train Uni-Fold.

3.1 Configuration.

Before you conduct any actual training processes, please make sure that you correctly configured the code. Modify the training configurations in unifold/train/train_config.py. We annotated the default configurations to reproduce AlphaFold in the script. Specifically, modify the data setups in unifold/train/train_config.py:

"data": {
  "train": {
    "features_dir": "where/training/protein/features/are/stored/",
    "mmcif_dir": "where/training/mmcif/files/are/stored/",
    "sample_weights": "which/specifies/proteins/for/training.json"
  },
  "eval": {
    "features_dir": "where/validation/protein/features/are/stored/",
    "mmcif_dir": "where/validation/mmcif/files/are/stored/",
    "sample_weights": "which/specifies/proteins/for/training.json"
  }
}

The specified data should be contained in two folders, namely a features_dir and a mmcif_dir. Organizations of the two directories are introduced in Section 2.3. Meanwhile, if you want to specify the subset of training data under the directories, or assign customized sample weights for each protein, write a json file and feed its path to sample_weights. This is optional, as you can leave it as None (and the program will attempt to use all entries under features_dir with uniform weights). The json file should be a dictionary contains the basename of directories of protein features ([pdb_id]_[model_id]_[chain_id]) and the sample weight of each protein in the training process (integer or float), such as:

{"1am9_1_C": 82, "1amp_1_A": 291, "1aoj_1_A": 60, "1aoz_1_A": 552}

or for uniform sampling, simply using a list of protein entries suffices:

["1am9_1_C", "1amp_1_A", "1aoj_1_A", "1aoz_1_A"]

Meanwhile, the configurations of models can be edited in unifold/model/config.py for users who want to customize their own folding models.

3.2 Run the training code!

To train the model on a single node without MPI, run

python train.py

You can also train the model using MPI (or workload managers that supports MPI, such as PBS or Slurm) by running:

mpirun -n <numer_of_gpus> python train.py

In either way, make sure you properly configurate the option use_mpi in unifold/train/train_config.py.

4. Inference with trained models.

4.1 Inference from features.pkl.

We provide the run_from_pkl.py script to support inferencing protein structures from features.pkl inputs. A demo command would be

python run_from_pkl.py \
  --pickle_dir ./example_data/features \
  --model_names model_2 \
  --model_paths /path/to/model_2.npz \
  --output_dir ./out

or

python run_from_pkl.py \
  --pickle_paths ./example_data/features/1ak0_1_A/features.pkl \
  --model_names model_2 \
  --model_paths /path/to/model_2.npz \
  --output_dir ./out

The command will generate structures of input features from different input models (in PDB format), the running time of each component, and corresponding residue-wise confidence score (predicted LDDT, or pLDDT).

4.2 Inference from FASTA files.

Essentially, inferencing the structures from given FASTA files includes two steps, i.e. generating the pickled features and predicting structures from them. We provided a script, run_from_fasta.py, as a more friendly user interface. An example usage would be

python run_from_pkl.py \
  --fasta_paths ./example_data/fasta/1ak0_1_A.fasta \
  --model_names model_2 \
  --model_paths /path/to/model_2.npz \
  --data_dir /path/to/database/directory
  --output_dir ./out

4.3 Generate MSA with MMseqs2.

It may take hours and much memory to generate MSA for sequences,especially for long sequences. In this condition, MMseqs2 may be a more efficient way. It can be used in the following way after it is installed:

# download and build database
mkdir mmseqs_db && cd mmseqs_db
wget http://wwwuser.gwdg.de/~compbiol/colabfold/uniref30_2103.tar.gz
wget http://wwwuser.gwdg.de/~compbiol/colabfold/colabfold_envdb_202108.tar.gz
tar xzvf uniref30_2103.tar.gz
tar xzvf colabfold_envdb_202108.tar.gz
mmseqs tsv2exprofiledb uniref30_2103 uniref30_2103_db
mmseqs tsv2exprofiledb colabfold_envdb_202108 colabfold_envdb_202108_db
mmseqs createindex uniref30_2103_db tmp
mmseqs createindex colabfold_envdb_202108_db tmp
cd ..

# MSA search
./scripts/colabfold_search.sh mmseqs "query.fasta" "mmseqs_db/" "result/" "uniref30_2103_db" "" "colabfold_envdb_202108_db" "1" "0" "1"

5. Changes from AlphaFold to Uni-Fold.

  • We implemented classes and methods for training and inference pipelines by adding scripts under unifold/train and unifold/inference.
  • We added scripts for installing the environment, training and inferencing.
  • Files under unifold/common, unifold/data and unifold/relax are minimally altered for re-structuring the repository.
  • Files under unifold/model are moderately altered to allow mixed-precision training.
  • We removed unused scripts in training AlphaFold model.

6. License and disclaimer.

6.1 Uni-Fold code license.

Copyright 2021 Beijing DP Technology Co., Ltd.

Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0.

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.

6.2 Use of third-party software.

Use of the third-party software, libraries or code may be governed by separate terms and conditions or license provisions. Your use of the third-party software, libraries or code is subject to any such terms and you should check that you can comply with any applicable restrictions or terms and conditions before use.

6.3 Contributing to Uni-Fold.

Uni-Fold is an ongoing project. Our target is to design better protein folding models and to apply them in real scenarios. We welcome the community to join us in developing the repository together, including but not limited to 1) reports and fixes of bugs,2) new features and 3) better interfaces. Please refer to CONTRIBUTING.md for more information.

Owner
DeepModeling
Define the future of scientific computing together
DeepModeling
Models, datasets and tools for Facial keypoints detection

Template for Data Science Project This repo aims to give a robust starting point to any Data Science related project. It contains readymade tools setu

girafe.ai 1 Feb 11, 2022
Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context Code in both PyTorch and TensorFlow

Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context This repository contains the code in both PyTorch and TensorFlow for our paper

Zhilin Yang 3.3k Jan 06, 2023
Match SafeGraph POIs with Data collected through a cultural resource survey in Washington DC.

Match SafeGraph POI data with Cultural Resource Places in Washington DC Match SafeGraph POIs with Data collected through a cultural resource survey in

Changjie Chen 1 Jan 05, 2022
A multilingual version of MS MARCO passage ranking dataset

mMARCO A multilingual version of MS MARCO passage ranking dataset This repository presents a neural machine translation-based method for translating t

75 Dec 27, 2022
MIM: MIM Installs OpenMMLab Packages

MIM provides a unified API for launching and installing OpenMMLab projects and their extensions, and managing the OpenMMLab model zoo.

OpenMMLab 254 Jan 04, 2023
Official implementation of VaxNeRF (Voxel-Accelearated NeRF).

VaxNeRF Paper | Google Colab This is the official implementation of VaxNeRF (Voxel-Accelearated NeRF). VaxNeRF provides very fast training and slightl

naruya 132 Nov 21, 2022
OBBDetection is a oriented object detection library, which is based on MMdetection.

OBBDetection news: We are now updating OBBDetection to new vision based on MMdetection v2.10, which has more advanced models and more efficient featur

jbwang1997 401 Jan 02, 2023
[CVPR 2022 Oral] Versatile Multi-Modal Pre-Training for Human-Centric Perception

Versatile Multi-Modal Pre-Training for Human-Centric Perception Fangzhou Hong1  Liang Pan1  Zhongang Cai1,2,3  Ziwei Liu1* 1S-Lab, Nanyang Technologic

Fangzhou Hong 96 Jan 03, 2023
Neural-fractal - Create Fractals Using Complex-Valued Neural Networks!

Neural Fractal Create Fractals Using Complex-Valued Neural Networks! Home Page Features Define Dynamical Systems Using Complex-Valued Neural Networks

Amirabbas Asadi 10 Dec 17, 2022
TSIT: A Simple and Versatile Framework for Image-to-Image Translation

TSIT: A Simple and Versatile Framework for Image-to-Image Translation This repository provides the official PyTorch implementation for the following p

Liming Jiang 255 Nov 23, 2022
TorchXRayVision: A library of chest X-ray datasets and models.

torchxrayvision A library for chest X-ray datasets and models. Including pre-trained models. ( 🎬 promo video about the project) Motivation: While the

Machine Learning and Medicine Lab 575 Jan 08, 2023
quantize aware training package for NCNN on pytorch

ncnnqat ncnnqat is a quantize aware training package for NCNN on pytorch. Table of Contents ncnnqat Table of Contents Installation Usage Code Examples

62 Nov 23, 2022
Physical Anomalous Trajectory or Motion (PHANTOM) Dataset

Physical Anomalous Trajectory or Motion (PHANTOM) Dataset Description This dataset contains the six different classes as described in our paper[]. The

0 Dec 16, 2021
DSTC10 Track 2 - Knowledge-grounded Task-oriented Dialogue Modeling on Spoken Conversations

DSTC10 Track 2 - Knowledge-grounded Task-oriented Dialogue Modeling on Spoken Conversations This repository contains the data, scripts and baseline co

Alexa 51 Dec 17, 2022
DyNet: The Dynamic Neural Network Toolkit

The Dynamic Neural Network Toolkit General Installation C++ Python Getting Started Citing Releases and Contributing General DyNet is a neural network

Chris Dyer's lab @ LTI/CMU 3.3k Jan 06, 2023
Code for "Unsupervised Layered Image Decomposition into Object Prototypes" paper

DTI-Sprites Pytorch implementation of "Unsupervised Layered Image Decomposition into Object Prototypes" paper Check out our paper and webpage for deta

40 Dec 22, 2022
True Few-Shot Learning with Language Models

This codebase supports using language models (LMs) for true few-shot learning: learning to perform a task using a limited number of examples from a single task distribution.

Ethan Perez 124 Jan 04, 2023
This is a classifier which basically predicts whether there is a gun law in a state or not, depending on various things like murder rates etc.

Gun-Laws-Classifier This is a classifier which basically predicts whether there is a gun law in a state or not, depending on various things like murde

Awais Saleem 1 Jan 20, 2022
Official implementation of the paper Visual Parser: Representing Part-whole Hierarchies with Transformers

Visual Parser (ViP) This is the official implementation of the paper Visual Parser: Representing Part-whole Hierarchies with Transformers. Key Feature

Shuyang Sun 117 Dec 11, 2022
Official pytorch implementation of Rainbow Memory (CVPR 2021)

Rainbow Memory: Continual Learning with a Memory of Diverse Samples

Clova AI Research 91 Dec 17, 2022