CHERRY is a python library for predicting the interactions between viral and prokaryotic genomes

Related tags

Deep LearningCHERRY
Overview

CHERRY CHERRY is a python library for predicting the interactions between viral and prokaryotic genomes. CHERRY is based on a deep learning model, which consists of a graph convolutional encoder and a link prediction decoder.

Overview

There are two kind of tasks that CHERRY can work:

  1. Host prediction for virus
  2. Identifying viruses that infect pathogenic bacteria

Users can choose one of the task when running CHERRY. If you have any trouble installing or using CHERRY, please let us know by opening an issue on GitHub or emailing us ([email protected]).

Required Dependencies

  • Python 3.x
  • Numpy
  • Pytorch>1.8.0
  • Networkx
  • Pandas
  • Diamond
  • BLAST
  • MCL
  • Prodigal

All these packages can be installed using Anaconda.

If you want to use the gpu to accelerate the program:

  • cuda
  • Pytorch-gpu

An easiler way to install

We recommend you to install all the package with Anaconda

After cloning this respository, you can use anaconda to install the CHERRY.yaml. This will install all packages you need with gpu mode (make sure you have installed cuda on your system to use the gpu version. Othervise, it will run with cpu version). The command is: conda env create -f CHERRY.yaml

  • For cpu version pytorch: conda install pytorch torchvision torchaudio cpuonly -c pytorch
  • For gpu version pytorch: Search pytorch to find the correct cuda version according to your computer Note: we suggest you to install all the package using conda (both miniconda and anaconda are ok). We supply a

Prepare the database

Due to the limited size of the GitHub, we zip the database. Before using CHEERY, you need to unpack them using the following commands.

cd CHEERY/dataset
bzip2 -d protein.fasta.bz2
bzip2 -d nucl.fasta.bz2
cd ../prokaryote
gunzip *
cd ..

Usage

1 Predicting host for viruses

If you want to predict hosts for viruses, the input should be a fasta file containing the virual sequences. We support an example file named "test_contigs.fa" in the Github folder. Then, the only command that you need to run is

python run_Speed_up.py [--contigs INPUT_FA] [--len MINIMUM_LEN] [--model MODEL] [--topk TOPK_PRED]

Options

  --contigs INPUT_FA
                        input fasta file
  --len MINIMUM_LEN
                        predict only for sequence >= len bp (default 8000)
  --model MODEL (pretrain or retrain)
                        predicting host with pretrained parameters or retrained paramters (default pretrain)
  --topk TOPK_PRED
                        The host prediction with topk score (default 1)

Example

Prediction on species level with pretrained paramters:

python run_Speed_up.py --contigs test_contigs.fa --len 8000 --model pretrain --topk 3

Note: Commonly, you do not need to retrain the model, especially when you do not have gpu unit.

OUTPUT

The format of the output file is a csv file ("final_prediction.csv") which contain the prediction of each virus. Column contig_name is the accession from the input.

Since the topk method is given, we cannot give the how taxaonmic tree for each prediction. However, we will supply a script for you to convert the prediction into a complte taxonmoy tree. Use the following command to generate taxonomy tree:

python run_Taxonomy_tree.py [--k TOPK_PRED]

Because there are k prediction in the "final_prediction.csv" file, you need to specify the k to generate the tree. The output of program is 'Top_k_prediction_taxonomy.csv'.

2 Predicting virus infecting prokaryote

If you want to predict hosts for viruses, you need to supply two kinds of inputs:

  1. Place your prokaryotic genomes in new_prokaryote/ folder.
  2. A fasta file containing the virus squences. Then, the program will output which virus in your fasta file will infect the prkaryotes in the new_prokaryote/ folder.

The command is simlar to the previous one but two more paramter is need:

python run_Speed_up.py [--mode MODE] [--t THRESHOLD]

Example

python run_Speed_up.py --contigs test_contigs.fa --mode prokaryote --t 0.98

Options

  --mode MODE (prokaryote or virus)
                        Switch mode for predicting virus or predicting host
  --t THRESHOLD
                        The confident threshold for predicting virus, the higier the threshold the higher the precision. (default 0.98)

OUTPUT

The format of the output file is a csv file which contain the prediction of each virus. Column prokaryote is the accession of your given prokaryotic genomes. Column virus is the list of viruses that might infect these genomes.

Extension of the parokaryotic genomes database

Due to the limitation of storage on GitHub, we only provided the parokaryote with known interactions (Date up to 2020) in prokaryote folder. If you want to predict interactions with more species, please place your parokaryotic genomes into prokaryote/ folder and add an entry of taxonomy information into dataset/prokaryote.csv. We also recommand you only add the prokaryotes of interest to save the computation resourse and time. This is because all the genomes in prokaryote folder will be used to generate the multimodal graph, which is a O(n^2) algorithm.

Example

If you have a metagenomic data and you know that only E. coli, Butyrivibrio fibrisolvens, and Faecalibacterium prausnitzii exist in the metagenomic data. Then you can placed the genomes of these three species into the prokaryote/ and add the entry in dataset/prokaryote.csv. An example of the entry is look like:

GCF_000007445,Bacteria,Proteobacteria,Gammaproteobacteria,Enterobacterales,Enterobacteriaceae,Escherichia,Escherichia coli

The corresponding header of the entry is: Accession,Superkingdom,Phylum,Class,Order,Family,Genus,Species. If you do not know the whole taxonomy tree, you can directly use a specific name for all columns. Because CHERRY is a link prediction tool, it will directly use the given name for prediction.

Noted: Since our program will use the accession for searching and constructing the knowledge graph, the name of the fasta file of your genomes should be the same as the given accession. For example, if your accession is GCF_000007445, your file name should be GCF_000007445.fa. Otherwise, the program cannot find the entry.

Extension of the virus-prokaryote interactions database

If you know more virus-prokaryote interactions than our pre-trained model (given in Interactiondata), you can add them to train a custom model. Several steps you need to do to train your model:

  1. Add your viral genomes into the nucl.fasta file and run the python refresh.py to generate new protein.fasta and database_gene_to_genome.csv files. They will replace the old one in the dataset/ folder automatically.
  2. Add the entrys of host taxonomy information into dataset/virus.csv. The corresponding header of the entry is: Accession (of the virus), Superkingdom, Phylum, Class, Order, Family, Genus, Species. The required field is Species. You can left it blank if you do not know other fields. Also, the accession of the virus shall be the same as your fasta entry.
  3. Place your prokaryotic genomes into the the prokaryote/ folder and add an entry in dataset/prokaryote.csv. The guideline is the same as the previous section.
  4. Use retrain as the parameter for --mode option to run the program.

References

The paper is submitted to the Briefings in Bioinformatics.

The arXiv version can be found via: CHERRY: a Computational metHod for accuratE pRediction of virus-pRokarYotic interactions using a graph encoder-decoder model

Contact

If you have any questions, please email us: [email protected]

Notes

  1. if the program output an error (which is caused by your machine): Error: mkl-service + Intel(R) MKL: MKL_THREADING_LAYER=INTEL is incompatible with libgomp.so.1 library. You can type in the command export MKL_SERVICE_FORCE_INTEL=1 before runing run_Speed_up.py
Owner
Kenneth Shang
Kenneth Shang
Tensorflow implementation of Swin Transformer model.

Swin Transformer (Tensorflow) Tensorflow reimplementation of Swin Transformer model. Based on Official Pytorch implementation. Requirements tensorflow

167 Jan 08, 2023
Fine-grained Post-training for Improving Retrieval-based Dialogue Systems - NAACL 2021

Fine-grained Post-training for Multi-turn Response Selection Implements the model described in the following paper Fine-grained Post-training for Impr

Janghoon Han 83 Dec 20, 2022
This repo contains implementation of different architectures for emotion recognition in conversations.

Emotion Recognition in Conversations Updates 🔥 🔥 🔥 Date Announcements 03/08/2021 🎆 🎆 We have released a new dataset M2H2: A Multimodal Multiparty

Deep Cognition and Language Research (DeCLaRe) Lab 1k Dec 30, 2022
Style-based Neural Drum Synthesis with GAN inversion

Style-based Drum Synthesis with GAN Inversion Demo TensorFlow implementation of a style-based version of the adversarial drum synth (ADS) from the pap

Sound and Music Analysis (SoMA) Group 29 Nov 19, 2022
A collection of Google research projects related to Federated Learning and Federated Analytics.

Federated Research Federated Research is a collection of research projects related to Federated Learning and Federated Analytics. Federated learning i

Google Research 483 Jan 05, 2023
Preprossing-loan-data-with-NumPy - In this project, I have cleaned and pre-processed the loan data that belongs to an affiliate bank based in the United States.

Preprossing-loan-data-with-NumPy In this project, I have cleaned and pre-processed the loan data that belongs to an affiliate bank based in the United

Dhawal Chitnavis 2 Jan 03, 2022
A Python reference implementation of the CF data model

cfdm A Python reference implementation of the CF data model. References Compliance with FAIR principles Documentation https://ncas-cms.github.io/cfdm

NCAS CMS 25 Dec 13, 2022
Semi-supervised Implicit Scene Completion from Sparse LiDAR

Semi-supervised Implicit Scene Completion from Sparse LiDAR Paper Created by Pengfei Li, Yongliang Shi, Tianyu Liu, Hao Zhao, Guyue Zhou and YA-QIN ZH

114 Nov 30, 2022
For encoding a text longer than 512 tokens, for example 800. Set max_pos to 800 during both preprocessing and training.

LongScientificFormer For encoding a text longer than 512 tokens, for example 800. Set max_pos to 800 during both preprocessing and training. Some code

Athar Sefid 6 Nov 02, 2022
Implementation EfficientDet: Scalable and Efficient Object Detection in PyTorch

Implementation EfficientDet: Scalable and Efficient Object Detection in PyTorch

tonne 1.4k Dec 29, 2022
Graph parsing approach to structured sentiment analysis.

Fine-grained Sentiment Analysis as Dependency Graph Parsing This repository contains the code and datasets described in following paper: Fine-grained

Jeremy Barnes 36 Dec 12, 2022
This code is 3d-CNN model that can predict environmental value

Predict-environmental-value-3dCNN This code is 3d-CNN model that can predict environmental value. Firstly, I built a model that can create a lot of bu

1 Jan 06, 2022
Third party Pytorch implement of Image Processing Transformer (Pre-Trained Image Processing Transformer arXiv:2012.00364v2)

ImageProcessingTransformer Third party Pytorch implement of Image Processing Transformer (Pre-Trained Image Processing Transformer arXiv:2012.00364v2)

61 Jan 01, 2023
PaddleRobotics is an open-source algorithm library for robots based on Paddle, including open-source parts such as human-robot interaction, complex motion control, environment perception, SLAM positioning, and navigation.

简体中文 | English PaddleRobotics paddleRobotics是基于paddle的机器人开源算法库集,包括人机交互、复杂运动控制、环境感知、slam定位导航等开源算法部分。 人机交互 主动多模交互技术TFVT-HRI 主动多模交互技术是通过视觉、语音、触摸传感器等输入机器人

185 Dec 26, 2022
nnFormer: Interleaved Transformer for Volumetric Segmentation Code for paper "nnFormer: Interleaved Transformer for Volumetric Segmentation "

nnFormer: Interleaved Transformer for Volumetric Segmentation Code for paper "nnFormer: Interleaved Transformer for Volumetric Segmentation ". Please

jsguo 610 Dec 28, 2022
Second Order Optimization and Curvature Estimation with K-FAC in JAX.

KFAC-JAX - Second Order Optimization with Approximate Curvature in JAX Installation | Quickstart | Documentation | Examples | Citing KFAC-JAX KFAC-JAX

DeepMind 90 Dec 22, 2022
The MATH Dataset

Measuring Mathematical Problem Solving With the MATH Dataset This is the repository for Measuring Mathematical Problem Solving With the MATH Dataset b

Dan Hendrycks 267 Dec 26, 2022
PyTorch implementation of DD3D: Is Pseudo-Lidar needed for Monocular 3D Object detection?

PyTorch implementation of DD3D: Is Pseudo-Lidar needed for Monocular 3D Object detection? (ICCV 2021), Dennis Park*, Rares Ambrus*, Vitor Guizilini, Jie Li, and Adrien Gaidon.

Toyota Research Institute - Machine Learning 364 Dec 27, 2022
ArtEmis: Affective Language for Art

ArtEmis: Affective Language for Art Created by Panos Achlioptas, Maks Ovsjanikov, Kilichbek Haydarov, Mohamed Elhoseiny, Leonidas J. Guibas Introducti

Panos 268 Dec 12, 2022
Semi-supervised learning for object detection

Source code for STAC: A Simple Semi-Supervised Learning Framework for Object Detection STAC is a simple yet effective SSL framework for visual object

Google Research 348 Dec 25, 2022