Accurate identification of bacteriophages from metagenomic data using Transformer

Related tags

Deep LearningPhaMer
Overview

PhaMer

PhaMer is a python library for identifying bacteriophages from metagenomic data. PhaMer is based on a Transorfer model and rely on protein-based vocabulary to convert DNA sequences into sentences.

Overview

The main function of PhaMer is to identify phage-like contigs from metagenomic data. The input of the program should be fasta files and the output will be a csv file showing the predictions. Since it is a Deep learning model, if you have GPU units on your PC, we recommand you to use them to save your time.

If you have any trouble installing or using PhaMer, please let us know by opening an issue on GitHub or emailing us ([email protected]).

Required Dependencies

If you want to use the gpu to accelerate the program:

  • cuda

  • Pytorch-gpu

  • For cpu version pytorch: conda install pytorch torchvision torchaudio cpuonly -c pytorch

  • For gpu version pytorch: Search pytorch to find the correct cuda version according to your computer

An easiler way to install

Note: we suggest you to install all the package using conda (both miniconda and Anaconda are ok).

After cloning this respository, you can use anaconda to install the PhaMer.yaml. This will install all packages you need with gpu mode (make sure you have installed cuda on your system to use the gpu version. Othervise, it will run with cpu version). The command is: conda env create -f PhaMer.yaml -n phamer

Prepare the database and environment

Due to the limited size of the GitHub, we zip the database. Before using PhaMer, you need to unpack them using the following commands.

  1. When you use PhaMer at the first time
cd PhaMer/
conda env create -f PhaMer.yaml -n phamer
conda activate phamer
cd database/
bzip2 -d database.fa.bz2
git lfs install
rm transformer.pth
git checkout .
cd ..

Note: Because the parameter is larger than 100M, please make sure you have installed git-lfs to downloaded it from GitHub

  1. If the example can be run without any but bugs, you only need to activate your 'phamer' environment before using PhaMer.
conda activate phamer

Usage

python preprocessing.py [--contigs INPUT_FA] [--len MINIMUM_LEN]
python PhaMer.py [--out OUTPUT_CSV] [--reject THRESHOLD]

Options

  --contigs INPUT_FA
                        input fasta file
  --len MINIMUM_LEN
                        predict only for sequence >= len bp (default 3000)
  --out OUTPUT_CSV
                        The output csv file (prediction)
  --reject THRESHOLD
                        Threshold to reject prophage. The higher the value, the more prophage will be rejected (default 0.3)

Example

Prediction on the example file:

python preprocessing.py --contigs test_contigs.fa
python PhaMer.py --out example_prediction.csv

The prediction will be written in example_prediction.csv. The CSV file has three columns: contigs names, prediction, and prediction score.

References

The paper is submitted to the ISMB 2022.

The arXiv version can be found via: Accurate identification of bacteriophages from metagenomic data using Transformer

Contact

If you have any questions, please email us: [email protected]

Comments
  • issues collections from schackartk (solved)

    issues collections from schackartk (solved)

    Hi! Thank you for publishing your code publicly.

    I am a researcher who works with many tools that identify phage in metagenomes. However, I like to be confident in the implementation of the concepts. I noticed that your repository does not have any formal testing. Without tests, I am always skeptical about implementing a tool in my own work because I cannot be sure it is working as described.

    Would your team be interested in adding tests to the code (e.g. using pytest)? If I, or another developer, were to create a pull request that implemented testing, would your team consider accepting such a request?

    Also, it is a small thing, but I noticed that your code is not formatted in any community-accepted way. Would you consider accepting a pull request that has passed the code through a linter such as yapf or black? I usually add linting as part of my test suites.

    opened by schackartk 12
  • Rename preprocessing.py?

    Rename preprocessing.py?

    Hi Kenneth,

    preprocessing.py is a pretty generic name; maybe rename the script to PhaMer_preprocess.py to avoid potential future conflicts with other software?

    opened by sjaenick 1
  • bioconda recipe

    bioconda recipe

    Any plans on creating a bioconda recipe for PhaMer? That would greatly help users with the install & version management of PhaMer.

    Also in regards to:

    Because the parameter is larger than 100M, please make sure you have downloaded transformer.pth correctly.

    Why not just use md5sum?

    opened by nick-youngblut 1
  • Threading/Performance updates

    Threading/Performance updates

    Hi,

    • introduce ---threads to control threading behavior
    • allow to supply external database directory, so DIAMOND database formatting isn't needed every time
    • use 'pprodigal' for faster gene prediction step
    • removed unused imports

    Please note I didn't yet add pprodigal to the conda yaml - feel free to do so if you want to include it

    opened by sjaenick 1
  • Bug: Unable to clone repository

    Bug: Unable to clone repository

    Hello,

    It seems that this repository is exceeding its data transfer limits. I believe you are aware of this, as you instruct users to download the transformer.pth from Google Drive.

    However, it seems that I cannot clone the repository in general. I just want to make sure this is not a problem on my end, so I will walk through what I am doing.

    Reproducible Example

    First, cloning the repository

    $ git clone [email protected]:KennthShang/PhaMer.git
    Cloning into 'PhaMer'...
    Downloading database/transformer.pth (143 MB)
    Error downloading object: database/transformer.pth (28a82c1): Smudge error: Error downloading database/transformer.pth (28a82c1ca0fb2499c0071c685dbf49f3a0d060fdc231bb04f7535e88e7fe0858): batch response: This repository is over its data quota. Account responsible for LFS bandwidth should purchase more data packs to restore access.
    
    Errors logged to /xdisk/bhurwitz/mig2020/rsgrps/bhurwitz/schackartk/projects/PhaMer/.git/lfs/logs/20220124T113548.16324098.log
    Use `git lfs logs last` to view the log.
    error: external filter git-lfs smudge -- %f failed 2
    error: external filter git-lfs smudge -- %f failed
    fatal: database/transformer.pth: smudge filter lfs failed
    warning: Clone succeeded, but checkout failed.
    You can inspect what was checked out with 'git status'
    and retry the checkout with 'git checkout -f HEAD'
    

    Checking git status as suggested indicates that several files were not checked out

    $ cd PhaMer/
    $ git status
    # On branch main
    # Changes to be committed:
    #   (use "git reset HEAD <file>..." to unstage)
    #
    #	deleted:    .gitattributes
    #	deleted:    LICENSE.txt
    #	deleted:    PhaMer.py
    #	deleted:    PhaMer.yaml
    #	deleted:    README.md
    #	deleted:    database/.DS_Store
    #	deleted:    database/contigs.csv
    #	deleted:    database/database.fa.bz2
    #	deleted:    database/pc2wordsid.dict
    #	deleted:    database/pcs.csv
    #	deleted:    database/profiles.csv
    #	deleted:    database/proteins.csv
    #	deleted:    database/transformer.pth
    #	deleted:    logo.jpg
    #	deleted:    model.py
    #	deleted:    preprocessing.py
    #	deleted:    test_contigs.fa
    #
    # Untracked files:
    #   (use "git add <file>..." to include in what will be committed)
    #
    #	.gitattributes
    #	LICENSE.txt
    #	PhaMer.py
    #	PhaMer.yaml
    #	README.md
    #	database/
    

    To confirm that several files are missing, such as preprocessing.py.

    $ ls
    database  LICENSE.txt  PhaMer.py  PhaMer.yaml  README.md
    

    Continuing with installation instructions anyway in case they resolve these issues.

    $ conda env create -f PhaMer.yaml -n phamer
    Collecting package metadata (repodata.json): done
    Solving environment: done
    Preparing transaction: done
    Verifying transaction: done
    Executing transaction: \ By downloading and using the CUDA Toolkit conda packages, you accept the terms and conditions of the CUDA End User License Agreement (EULA): https://docs.nvidia.com/cuda/eula/index.html
    
    done
    Installing pip dependencies: / Ran pip subprocess with arguments:
    ['/home/u29/schackartk/.conda/envs/phamer/bin/python', '-m', 'pip', 'install', '-U', '-r', '/xdisk/bhurwitz/mig2020/rsgrps/bhurwitz/schackartk/projects/PhaMer/condaenv.1nhq2jy1.requirements.txt']
    Pip subprocess output:
    Collecting joblib==1.1.0
      Using cached joblib-1.1.0-py2.py3-none-any.whl (306 kB)
    Collecting scikit-learn==1.0.1
      Using cached scikit_learn-1.0.1-cp38-cp38-manylinux_2_12_x86_64.manylinux2010_x86_64.whl (25.9 MB)
    Collecting sklearn==0.0
      Using cached sklearn-0.0-py2.py3-none-any.whl
    Collecting threadpoolctl==3.0.0
      Using cached threadpoolctl-3.0.0-py3-none-any.whl (14 kB)
    Requirement already satisfied: scipy>=1.1.0 in /home/u29/schackartk/.conda/envs/phamer/lib/python3.8/site-packages (from scikit-learn==1.0.1->-r /xdisk/bhurwitz/mig2020/rsgrps/bhurwitz/schackartk/projects/PhaMer/condaenv.1nhq2jy1.requirements.txt (line 2)) (1.7.1)
    Requirement already satisfied: numpy>=1.14.6 in /home/u29/schackartk/.conda/envs/phamer/lib/python3.8/site-packages (from scikit-learn==1.0.1->-r /xdisk/bhurwitz/mig2020/rsgrps/bhurwitz/schackartk/projects/PhaMer/condaenv.1nhq2jy1.requirements.txt (line 2)) (1.21.2)
    Installing collected packages: threadpoolctl, joblib, scikit-learn, sklearn
    Successfully installed joblib-1.1.0 scikit-learn-1.0.1 sklearn-0.0 threadpoolctl-3.0.0
    
    done
    #
    # To activate this environment, use
    #
    #     $ conda activate phamer
    #
    # To deactivate an active environment, use
    #
    #     $ conda deactivate
    

    Activating the conda environment and attempting to get transformer.pth

    $ conda activate phamer
    $ cd database
    $ bzip2 -d database.fa.bz2
    $ git lfs install
    Updated git hooks.
    Git LFS initialized.
    $ rm transformer.pth
    rm: cannot remove ‘transformer.pth’: No such file or directory
    $ git checkout .
    error: pathspec './' did not match any file(s) known to git.
    
    $ ls
    contigs.csv  database.fa  pc2wordsid.dict  pcs.csv  profiles.csv  proteins.csv
    

    Since the file doesn't seem to exist, I followed your Google Drive link and pasted into database/ manually.

    $ ls
    contigs.csv  database.fa  pc2wordsid.dict  pcs.csv  profiles.csv  proteins.csv  transformer.pth
    

    I will try checking out again.

    $ git checkout .
    error: pathspec './' did not match any file(s) known to git.
    

    Going back up, you can see that I am still missing scripts.

    $ cd ..
    $ ls
    metaphinder_reprex  phage_finders  PhaMer  snakemake_tutorial
    

    Conclusions

    I am missing the scripts and cannot run the tool. I believe this all comes down to the repo exceedingits data transfer limits. This is probably due to you storing the large database files in the repository.

    Possible solution?

    Going forward, maybe entirely remove the large files from the repo so that you don't exceed limits. I am not sure what I can do at this moment since the limits are already exceeded.

    Also, it would be helpful if I could obtain the transformer.pth from the command line (e.g. using wget) since I, and many researchers, are working on an HPC or cloud.

    Thank you, -Ken

    opened by schackartk 1
Releases(v1.0)
Owner
Kenneth Shang
Kenneth Shang
Official implementation of "Learning Not to Reconstruct" (BMVC 2021)

Official PyTorch implementation of "Learning Not to Reconstruct Anomalies" This is the implementation of the paper "Learning Not to Reconstruct Anomal

Marcella Astrid 13 Dec 04, 2022
⚓ Eurybia monitor model drift over time and securize model deployment with data validation

View Demo · Documentation · Medium article 🔍 Overview Eurybia is a Python library which aims to help in : Detecting data drift and model drift Valida

MAIF 172 Dec 27, 2022
Pretraining on Dynamic Graph Neural Networks

Pretraining on Dynamic Graph Neural Networks Our article is PT-DGNN and the code is modified based on GPT-GNN Requirements python 3.6 Ubuntu 18.04.5 L

7 Dec 17, 2022
Algorithms for outlier, adversarial and drift detection

Alibi Detect is an open source Python library focused on outlier, adversarial and drift detection. The package aims to cover both online and offline d

Seldon 1.6k Dec 31, 2022
Efficient and Scalable Physics-Informed Deep Learning and Scientific Machine Learning on top of Tensorflow for multi-worker distributed computing

Notice: Support for Python 3.6 will be dropped in v.0.2.1, please plan accordingly! Efficient and Scalable Physics-Informed Deep Learning Collocation-

tensordiffeq 74 Dec 09, 2022
This repository contains the code and models necessary to replicate the results of paper: How to Robustify Black-Box ML Models? A Zeroth-Order Optimization Perspective

Black-Box-Defense This repository contains the code and models necessary to replicate the results of our recent paper: How to Robustify Black-Box ML M

OPTML Group 2 Oct 05, 2022
State-of-the-art data augmentation search algorithms in PyTorch

MuarAugment Description MuarAugment is a package providing the easiest way to a state-of-the-art data augmentation pipeline. How to use You can instal

43 Dec 12, 2022
Pytorch for Segmentation

Pytorch for Semantic Segmentation This repo has been deprecated currently and I will not maintain it. Meanwhile, I strongly recommend you can refer to

ycszen 411 Nov 22, 2022
一个多模态内容理解算法框架,其中包含数据处理、预训练模型、常见模型以及模型加速等模块。

Overview 架构设计 插件介绍 安装使用 框架简介 方便使用,支持多模态,多任务的统一训练框架 能力列表: bert + 分类任务 自定义任务训练(插件注册) 框架设计 框架采用分层的思想组织模型训练流程。 DATA 层负责读取用户数据,根据 field 管理数据。 Parser 层负责转换原

Tencent 265 Dec 22, 2022
This is the official pytorch implementation for the paper: Instance Similarity Learning for Unsupervised Feature Representation.

ISL This is the official pytorch implementation for the paper: Instance Similarity Learning for Unsupervised Feature Representation, which is accepted

19 May 04, 2022
DexterRedTool - Dexter's Red Team Tool that creates cronjob/task scheduler to consistently creates users

DexterRedTool Author: Dexter Delandro CSEC 473 - Spring 2022 This tool persisten

2 Feb 16, 2022
SSD: A Unified Framework for Self-Supervised Outlier Detection [ICLR 2021]

SSD: A Unified Framework for Self-Supervised Outlier Detection [ICLR 2021] Pdf: https://openreview.net/forum?id=v5gjXpmR8J Code for our ICLR 2021 pape

Princeton INSPIRE Research Group 113 Nov 27, 2022
Hierarchical Clustering: O(1)-Approximation for Well-Clustered Graphs

Hierarchical Clustering: O(1)-Approximation for Well-Clustered Graphs This repository contains code to accompany the paper "Hierarchical Clustering: O

3 Sep 25, 2022
Sequential Model-based Algorithm Configuration

SMAC v3 Project Copyright (C) 2016-2018 AutoML Group Attention: This package is a reimplementation of the original SMAC tool (see reference below). Ho

AutoML-Freiburg-Hannover 778 Jan 05, 2023
A simple API wrapper for Discord interactions.

Your ultimate Discord interactions library for discord.py. About | Installation | Examples | Discord | PyPI About What is discord-py-interactions? dis

james 641 Jan 03, 2023
Learning Open-World Object Proposals without Learning to Classify

Learning Open-World Object Proposals without Learning to Classify Pytorch implementation for "Learning Open-World Object Proposals without Learning to

Dahun Kim 149 Dec 22, 2022
Pytorch ImageNet1k Loader with Bounding Boxes.

ImageNet 1K Bounding Boxes For some experiments, you might wanna pass only the background of imagenet images vs passing only the foreground. Here, I'v

Amin Ghiasi 11 Oct 15, 2022
The codebase for our paper "Generative Occupancy Fields for 3D Surface-Aware Image Synthesis" (NeurIPS 2021)

Generative Occupancy Fields for 3D Surface-Aware Image Synthesis (NeurIPS 2021) Project Page | Paper Xudong Xu, Xingang Pan, Dahua Lin and Bo Dai GOF

xuxudong 97 Nov 10, 2022
A system used to detect whether a person is wearing a medical mask or not.

Mask_Detection_System A system used to detect whether a person is wearing a medical mask or not. To open the program, please follow these steps: Make

Mohamed Emad 0 Nov 17, 2022
Mip-NeRF: A Multiscale Representation for Anti-Aliasing Neural Radiance Fields.

This repository contains the code release for Mip-NeRF: A Multiscale Representation for Anti-Aliasing Neural Radiance Fields. This implementation is written in JAX, and is a fork of Google's JaxNeRF

Google 625 Dec 30, 2022