Contrastive Learning for Metagenomic Binning

Related tags

Deep LearningCLMB
Overview

CLMB

A simple framework for CLMB - a novel deep Contrastive Learningfor Metagenomic Binning

Created by Pengfei Zhang, senior of Department of Computer Science, University of Science and Technology of China.

We develop it under the framework of VAMB, which is published on Nature Biotechnology (https://doi.org/10.1038/s41587-020-00777-4). All the commands are the same. We added files simclr_module.py augmentation.py and modified files __main__.py encode.py to implement our algorithm.

This is a simple implement of CLMB and the codes are not pretty. We will polish the code, add interfaces, and write documentations later.

The basic idea of the CLMB module is that, since the noise of real dataset is hard to calculate, we add simulated noise to the data and force the training to be robustto them. By effectively tacking the noise in the metagenomics data using the contrastive deep learningframework (https://arxiv.org/pdf/2002.05709.pdf), we can group pairs of contigs that originate from the same type of bacterial together while dividing contigs from different species to different bins.

Vamb

Created by Jakob Nybo Nissen and Simon Rasmussen, Technical University of Denmark and Novo Nordisk Foundation Center for Protein Research, University of Copenhagen.

Vamb is a metagenomic binner which feeds sequence composition information from a contig catalogue and co-abundance information from BAM files into a variational autoencoder and clusters the latent representation. It performs excellently with multiple samples, and pretty good on single-sample data. Vamb is implemented purely in Python (with a little bit of Cython) and can be used both from command line and from within a Python interpreter.

For more information about the implementation, methodological considerations, and advanced usage of Vamb, see the tutorial file (doc/tutorial.html)

Installation:

Install the latest version from GitHub you can clone and install it using:

git clone https://github.com/RasmussenLab/vamb -b master
cd vamb
pip install -e .

Running

For a detailed explanation of the parameters of Vamb, or different inputs, see the tutorial in the doc directory.

Updated in 3.0.2: for a snakemake pipeline see workflow directory.

For more command-line options, see the command-line help menu:

vamb -h

Here's how to run Vamb

For this example, let us suppose you have a directory of short (e.g. Illumina) reads in a directory /path/to/reads, and that you have already quality controlled them.

  1. Run your favorite metagenomic assembler on each sample individually:
spades.py --meta /path/to/reads/sample1.fw.fq.gz /path/to/reads/sample1.rv.fq.gz
-k 21,29,39,59,79,99 -t 24 -m 100gb -o /path/to/assemblies/sample1
  1. Use Vamb's concatenate.py to make the FASTA catalogue of all your assemblies:
concatenate.py /path/to/catalogue.fna.gz /path/to/assemblies/sample1/contigs.fasta
/path/to/assemblies/sample2/contigs.fasta  [ ... ]
  1. Use your favorite short-read aligner to map each your read files back to the resulting FASTA file:
minimap2 -d catalogue.mmi /path/to/catalogue.fna.gz; # make index
minimap2 -t 8 -N 50 -ax sr catalogue.mmi /path/to/reads/sample1.fw.fq.gz /path/to/reads/sample1.rv.fq.gz | samtools view -F 3584 -b --threads 8 > /path/to/bam/sample1.bam
  1. Run Vamb:
vamb --outdir path/to/outdir --fasta /path/to/catalogue.fna.gz --bamfiles /path/to/bam/*.bam -o C --minfasta 200000

Note that we have found that MetaBAT2's jgi_summarize_bam_contig_depths program estimates BAM depths more accurate than Vamb's parsebam module (see below). If you want to use this approach instead we provide an easy to use snakemake workflow which will do this for you.

Snakemake workflow

To make it even easier to run Vamb in the best possible way, we have created a Snakemake workflow that will run steps 2-4 above using MetaBAT2's jgi_summarize_bam_contig_depths program for improved counting. Additionally it will run CheckM to estimate completeness and contamination of the resulting bins. It can run both on a local machine, a workstation and a HPC system using qsub - it is included in the workflow folder.

Invoking Vamb

After installation with pip, Vamb will show up in your PATH variable, and you can simply run:

vamb

To run Vamb with another Python executable (say, if you want to run with python3.7) than the default, you can run:

python3.7 -m vamb

You can also run the inner vamb directory as a script. This will work even if you did not install with pip:

python my_scripts/vamb/vamb

Inputs and outputs

Inputs

Also see the section: Recommended workflow

Vamb relies on two properties of the DNA sequences to be binned:

  • The kmer-composition of the sequence (here tetranucleotide frequency, TNF) and
  • The abundance of the contigs in each sample (the depth or the RPKM).

So before you can run Vamb, you need to have files from which Vamb can calculate these values:

  • TNF is calculated from a regular FASTA file of DNA sequences.
  • Depth is calculated from BAM-files of a set of reads from each sample mapped to that same FASTA file.

⚠️ Important: Vamb can use information from multi-mapping reads, but all alignments of a single read must be consecutive in the BAM files. See section 4 of Recommended workflow.

Remember that the quality of Vamb's bins are no better than the quality of the input files. If your BAM files are constructed carelessly, for example by allowing reads from distinct species to crossmap indiscriminately, your BAM files will not contain information with which Vamb can separate those species. In general, you want reads to map only to contigs within the same phylogenetic distance that you want Vamb to bin together.

Estimation of TNF and RPKM is subject to statistical uncertainty. Therefore, Vamb works less well on short sequences and on data with low depth. Vamb can work on shorter sequences such as genes, which are more easily homology reduced. However, we recommend not using homology reduction on the input sequences, and instead prevent duplicated strains by using binsplitting (see section: recommended workflow.)

Outputs

Vamb produces the following output files:

  • log.txt - a text file with information about the Vamb run. Look here (and at stderr) if you experience errors.
  • tnf.npz, lengths.npz rpkm.npz, mask.npz and latent.npz - Numpy .npz files with TNFs, contig lengths. RPKM, which sequences were successfully encoded, and the latent encoding of the sequences.
  • model.pt - containing a PyTorch model object of the trained VAE. You can load the VAE from this file using vamb.encode.VAE.load from Python.
  • clusters.tsv - a two-column text file with one row per sequence: Left column for the cluster (i.e bin) name, right column for the sequence name. You can create the FASTA-file bins themselves using vamb.vambtools.write_bins, or using the function vamb.vambtools.write_bins (see doc/tutorial.html for more details).

Recommended workflow

1) Preprocess the reads and check their quality

We use AdapterRemoval combined with FastQC for this - but you can use whichever tool you think gives the best results.

2) Assemble each sample individually and get the contigs out

We recommend using metaSPAdes on each sample individually. You can also use scaffolds or other nucleotide sequences instead of contigs as input sequences to Vamb. Assemble each sample individually, as single-sample assembly followed by samplewise binsplitting gives the best results.

3) Concatenate the FASTA files together while making sure all contig headers stay unique, and filter away small contigs

You can use the function vamb.vambtools.concatenate_fasta for this or the script src/concatenate.py.

⚠️ Important: Vamb uses a neural network to encode sequences, and neural networks overfit on small datasets. We have tested that Vamb's neural network does not overfit too badly on all datasets we have worked with, but we have not tested on any dataset with fewer than 50,000 contigs.

You should not try to bin very short sequences. When deciding the length cutoff for your input sequences, there's a tradeoff here between choosing a too low cutoff, retaining hard-to-bin contigs which adversely affects the binning of all contigs, and choosing a too high one, throwing out good data. We use a length cutoff of 2000 bp as default but haven't actually run tests for the optimal value.

Your contig headers must be unique. Furthermore, if you want to use binsplitting (and you should!), your contig headers must be of the format {Samplename}{Separator}{X}, such that the part of the string before the first occurrence of {Separator} gives a name of the sample it originated from. For example, you could call contig number 115 from sample number 9 "S9C115", where "S9" would be {Samplename}, "C" is {Separator} and "115" is {X}.

Vamb is faily memory efficient, and we have run Vamb with 1000 samples and 5.9 million contigs using <30 GB of RAM. If you have a dataset too large to fit in RAM and feel the temptation to bin each sample individually, you can instead use a tool like MASH to group similar samples together in smaller batches, bin these batches individually. This way, you can still leverage co-abundance. NB: We have a version using memory-mapping that is much more RAM-efficient but 10-20% slower. Here we have processed a dataset of 942 samples with 30M contigs (total of 117Gbp contig sequence) in 40Gb RAM - see branch mmap.

4) Map the reads to the FASTA file to obtain BAM files

⚠️ Important: If you allow reads to map to multiple contigs, the abundance estimation will be more accurate. However, all BAM records for a single read must be consecutive in the BAM file, or else Vamb will miscount these alignments. This is the default order in the output of almost all aligners, but if you use BAM files sorted by alignment position and have multi-mapping reads, you must sort them by read name first.

Be careful to choose proper parameters for your aligner - in general, if reads from contig A align to contig B, then Vamb will bin A and B together. So your aligner should map reads with the same level of discrimination that you want Vamb to use. Although you can use any aligner that produces a specification-compliant BAM file, we prefer using minimap2:

minimap2 -T almeida.fna -t 28 -N 5 -ax sr almeida.mmi sample1.forward.fastq.gz sample1.reverse.fastq.gz | samtools view -F 3584 -b --threads 8 > sample1.bam

⚠️ Important: Do not filter the aligments for mapping quality as specified by the MAPQ field of the BAM file. This field gives the probability that the mapping position is correct, which is influenced by the number of alternative mapping locations. Filtering low MAPQ alignments away removes alignments to homologous sequences which biases the depth estimation.

If you are using BAM files where you do not trust the validity of every alignment in the file, you can filter the alignments for minimum nucleotide identity using the -z flag (uses the NM optional field of the alignment, we recommend setting it to 0.95), and/or filter for minimum alignments score using the -s flag (uses the AS optional field of the alignment.)

We have found that MetaBAT2's jgi_summarize_bam_contig_depths program estimates BAM depths more accurate than Vamb's parsebam module. For the best results, we recommend downloading MetaBAT2, using jgi_summarize_bam_contig_depths to estimate depths, and then running Vamb with --jgi instead of --bamfiles. Also consider using the snakemake workflow which will do this for you.

5) Run Vamb

By default, Vamb does not output any FASTA files of the bins. In the examples below, the option --minfasta 200000 is set, meaning that all bins with a size of 200 kbp or more will be output as FASTA files. If you trust the alignments in your BAM files, use:

vamb -o SEP --outdir OUT --fasta FASTA --bamfiles BAM1 BAM2 [...] --minfasta 200000,

where SEP in the {Separator} chosen in step 3, e.g. C in that example, OUT is the name of the output directory to create, FASTA the path to the FASTA file and BAM1 the path to the first BAM file. You can also use shell globbing to input multiple BAM files: my_bamdir/*bam.

If you don't trust your alignments, set the -z and -s flag as appropriate, depending on the properties of your aligner. For example, if I used the aligner BWA MEM, I would use:

vamb -o SEP -z 0.95 -s 30 --outdir OUT --fasta FASTA --bamfiles BAM1 BAM2 [...] --minfasta 200000

Parameter optimisation (optional)

The default hyperparameters of Vamb will provide good performance on any dataset. However, since running Vamb is fast (especially using GPUs) it is possible to try to run Vamb with different hyperparameters to see if better performance can be achieved (note that here we measure performance as the number of near-complete bins assessed by CheckM). We recommend to try to increase and decrease the size of the neural network and have used Vamb on datasets where increasing the network resulted in more near-complete bins and other datasets where decreasing the network resulted in more near-complete bins. To do this you can run Vamb as (default for multiple samples is -l 32 -n 512 512)`:

vamb -l 24 -n 384 384 --outdir path/to/outdir --fasta /path/to/catalogue.fna.gz --bamfiles /path/to/bam/*.bam -o C --minfasta 200000
vamb -l 40 -n 768 768 --outdir path/to/outdir --fasta /path/to/catalogue.fna.gz --bamfiles /path/to/bam/*.bam -o C --minfasta 200000

It is possible to try any combination of latent and hidden neurons as well as other sizes of the layers. Number of near-complete bins can be assessed using CheckM and compared between the methods. Potentially see the snakemake folder workflow for an automated way to run Vamb with multiple parameters.

PyTorch implementation of a Real-ESRGAN model trained on custom dataset

Real-ESRGAN PyTorch implementation of a Real-ESRGAN model trained on custom dataset. This model shows better results on faces compared to the original

Sber AI 160 Jan 04, 2023
Human Activity Recognition example using TensorFlow on smartphone sensors dataset and an LSTM RNN. Classifying the type of movement amongst six activity categories - Guillaume Chevalier

LSTMs for Human Activity Recognition Human Activity Recognition (HAR) using smartphones dataset and an LSTM RNN. Classifying the type of movement amon

Guillaume Chevalier 3.1k Dec 30, 2022
VITS: Conditional Variational Autoencoder with Adversarial Learning for End-to-End Text-to-Speech

VITS: Conditional Variational Autoencoder with Adversarial Learning for End-to-End Text-to-Speech Jaehyeon Kim, Jungil Kong, and Juhee Son In our rece

Jaehyeon Kim 1.7k Jan 08, 2023
SnapMix: Semantically Proportional Mixing for Augmenting Fine-grained Data (AAAI 2021)

SnapMix: Semantically Proportional Mixing for Augmenting Fine-grained Data (AAAI 2021) PyTorch implementation of SnapMix | paper Method Overview Cite

DavidHuang 126 Dec 30, 2022
This is a repository for a semantic segmentation inference API using the OpenVINO toolkit

BMW-IntelOpenVINO-Segmentation-Inference-API This is a repository for a semantic segmentation inference API using the OpenVINO toolkit. It's supported

BMW TechOffice MUNICH 34 Nov 24, 2022
Notes, programming assignments and quizzes from all courses within the Coursera Deep Learning specialization offered by deeplearning.ai

Coursera-deep-learning-specialization - Notes, programming assignments and quizzes from all courses within the Coursera Deep Learning specialization offered by deeplearning.ai: (i) Neural Networks an

Aman Chadha 1.7k Jan 08, 2023
Training PSPNet in Tensorflow. Reproduce the performance from the paper.

Training Reproduce of PSPNet. (Updated 2021/04/09. Authors of PSPNet have provided a Pytorch implementation for PSPNet and their new work with support

Li Xuhong 126 Jul 13, 2022
Efficient Householder transformation in PyTorch

Efficient Householder Transformation in PyTorch This repository implements the Householder transformation algorithm for calculating orthogonal matrice

Anton Obukhov 49 Nov 20, 2022
Official Repository for the paper "Improving Baselines in the Wild".

iWildCam and FMoW baselines (WILDS) This repository was originally forked from the official repository of WILDS datasets (commit 7e103ed) For general

Kazuki Irie 3 Nov 24, 2022
Autonomous Perception: 3D Object Detection with Complex-YOLO

Autonomous Perception: 3D Object Detection with Complex-YOLO LiDAR object detect

Thomas Dunlap 2 Feb 18, 2022
A Keras implementation of YOLOv4 (Tensorflow backend)

keras-yolo4 请使用更完善的版本: https://github.com/miemie2013/Keras-YOLOv4 Please visit here for more complete model: https://github.com/miemie2013/Keras-YOLOv

384 Nov 29, 2022
A PyTorch implementation for Unsupervised Domain Adaptation by Backpropagation(DANN), support Office-31 and Office-Home dataset

DANN A PyTorch implementation for Unsupervised Domain Adaptation by Backpropagation Prerequisites Linux or OSX NVIDIA GPU + CUDA (may CuDNN) and corre

8 Apr 16, 2022
InsCLR: Improving Instance Retrieval with Self-Supervision

InsCLR: Improving Instance Retrieval with Self-Supervision This is an official PyTorch implementation of the InsCLR paper. Download Dataset Dataset Im

Zelu Deng 25 Aug 30, 2022
An implementation of paper `Real-time Convolutional Neural Networks for Emotion and Gender Classification` with PaddlePaddle.

简介 通过PaddlePaddle框架复现了论文 Real-time Convolutional Neural Networks for Emotion and Gender Classification 中提出的两个模型,分别是SimpleCNN和MiniXception。利用 imdb_crop

8 Mar 11, 2022
ICS 4u HD project, start before-wards. A curtain shooting game using python.

Touhou-Star-Salvation HDCH ICS 4u HD project, start before-wards. A curtain shooting game using python and pygame. By Jason Li For arts and gameplay,

15 Dec 22, 2022
The pytorch implementation of SOKD (BMVC2021).

Semi-Online Knowledge Distillation Implementations of SOKD. Requirements This repo was tested with Python 3.8, PyTorch 1.5.1, torchvision 0.6.1, CUDA

4 Dec 19, 2021
AITUS - An atomatic notr maker for CYTUS

AITUS an automatic note maker for CYTUS. 利用AI根据指定乐曲生成CYTUS游戏谱面。 效果展示:https://www

GradiusTwinbee 6 Feb 24, 2022
Explaining Deep Neural Networks - A comparison of different CAM methods based on an insect data set

Explaining Deep Neural Networks - A comparison of different CAM methods based on an insect data set This is the repository for the Deep Learning proje

Robert Krug 3 Feb 06, 2022
一个目标检测的通用框架(不需要cuda编译),支持Yolo全系列(v2~v5)、EfficientDet、RetinaNet、Cascade-RCNN等SOTA网络。

一个目标检测的通用框架(不需要cuda编译),支持Yolo全系列(v2~v5)、EfficientDet、RetinaNet、Cascade-RCNN等SOTA网络。

Haoyu Xu 203 Jan 03, 2023
The deployment framework aims to provide a simple, lightweight, fast integrated, pipelined deployment framework that ensures reliability, high concurrency and scalability of services.

savior是一个能够进行快速集成算法模块并支持高性能部署的轻量开发框架。能够帮助将团队进行快速想法验证(PoC),避免重复的去github上找模型然后复现模型;能够帮助团队将功能进行流程拆解,很方便的提高分布式执行效率;能够有效减少代码冗余,减少不必要负担。

Tao Luo 125 Dec 22, 2022