ONT Analysis Toolkit (OAT)

Overview

Code style: black

                                       ,d
                                       88
              ,adPPYba,  ,adPPYYba, MM88MMM
              a8"     "8a ""     `Y8   88   
              8b       d8 ,adPPPPP88   88   
              "8a,   ,a8" 88,    ,88   88,  
              `"YbbdP"'  `"8bbdP"Y8   "Y888

ONT Analysis Toolkit (OAT)

A pipeline to facilitate sequencing of viral genomes (amplified with a tiled amplicon scheme) and assembly into consensus genomes. Supported viruses currently include Human betaherpesvirus 5 (CMV) and SARS-CoV-2. ONT sequencing data from a MinION can be monitored in real time with the rampart module. All analysis steps are handled by the analysis module, whereby sequencing data are analysed using a pipeline written in snakemake, with the choice of tools heavily influenced by the artic minion pipeline. Both steps can be run in order with a single command using the all module.

Author: Dr Charles Foster

Starting out

To begin with, clone this github repository:

git clone https://github.com/charlesfoster/ont-analysis-toolkit.git

cd ont-analysis-toolkit

Next, install most dependencies using conda:

conda env create -f environment.yml

Pro tip: if you install mamba, you can create the environment with that command instead of conda. A lot of conda headaches go away: it's a much faster drop-in replacement for conda.

conda install mamba
mamba env create -f environment.yml

Install the pipeline correctly by activating the conda environment and using the setup.py script with:

conda activate oat
pip install .

Other dependencies:

  • Demultiplexing is done using guppy_barcoder. The program will need to be installed and in your path.
  • If analysing SARS-CoV-2 data, lineages are typed using pangolin. Accordingly, pangolin needs to be installed according to instructions at https://github.com/cov-lineages/pangolin. The pipeline will fail if the pangolin environment can't be activated.
  • Variants are called using medaka and longshot. Ideally we could install these via mamba in the main environment.yml file, but there are sadly some necessary libraries for medaka that are incompatible with our main oat environment. Consequently, I have written the analysis pipeline so that snakemake automatically installs medaka and its dependencies into an isolated environment during execution of the oat pipeline. The environment is only created the first time you run the pipeline, but if you run the pipeline from a different working directory in the future, the environment will be created again. Solution: always run oat from the same working directory

tl;dr: you don't need to do anything for variant calling to work; just don't get confused during the initial pipeline run when the terminal indicates creation of a new conda environment. You should run oat from the same directory each time, otherwise a new conda environment will be created each time.

Usage

The environment with all necessary tools is installed as 'oat' for brevity. The environment should first be activated:

conda activate oat

Then, to run the pipeline, it's as simple as:

oat 
   

   

where should be replaced with the full path to a spreadsheet with minimal metadata for the ONT sequencing run (see example spreadsheet: run_data_example.csv). The most important thing to remember is that the 'run_name' in the spreadsheet must exactly match the name of the sequencing run, as set up in MinKNOW.

Note that there are many additional options/settings to take advantage of:

A pipeline for sequencing and analysis of viral genomes using an ONT MinION positional arguments: samples_file Path to file with sample metadata (.csv format). See example spreadsheet for minimum necessary information. optional arguments: -h, --help show this help message and exit -b , --barcode_kit Barcode kit that you used: '12' (SQK-RBK004) or '96' (SQK-RBK110-96) (default: 12) -c , --consensus_freq Variant allele frequency threshold for a variant to be incorporated into consensus genome. Variants below this frequency will be incorporated with an IUPAC ambiguity. Default: 0.8 -d, --demultiplex Demultiplex reads using guppy_barcoder. By default, assumes reads were already demultiplexed by MinKNOW. Reads are demultiplexed into the output directory. -f, --force Force overwriting of completed files in snakemake analysis (Default: files not overwritten) -n, --dry_run Dry run only -m rampart | analysis | all, --module rampart | analysis | all Pipeline module to run: 'rampart', 'analysis' or 'all' (rampart followed by analysis). Default: 'all' -o OUTDIR, --outdir OUTDIR Output directory. Default: /path/to/ont- analysis-toolkit/analysis_results + 'run_name' from samples spreadsheet --rampart_outdir RAMPART_OUTDIR Output directory. Default: /path/to/ont- analysis-toolkit/rampart_files -p, --print_dag Save directed acyclic graph (DAG) of workflow to outdir -r REFERENCE, --reference REFERENCE Reference genome to use: 'MN908947.3' (SARS-CoV-2), 'NC_006273.2' (CMV Merlin). Other references can be used, but the corresponding assembly (fasta) and annotation (gff3 from Ensembl) must be added to /home/cfos/miniconda3/envs/oat/lib/python3.9/site- packages/oat/references (Default: MN908947.3) -t , --threads Number of threads to use -v VARIANT_CALLER, --variant_caller VARIANT_CALLER Variant caller to use. Choices: 'medaka-longshot'. Default: 'medaka-longshot' --create_envs_only Create conda environments for snakemake analysis, but do no further analysis. Useful for initial pipeline setup. Default: False --snv_min SNV_MIN Minimum variant allele frequency for an SNV to be kept Default: 0.2 --delete_reads Delete demultiplexed reads after analysis --redo_analysis Delete entire analysis output directory and contents for a fresh run --version show program's version number and exit --minknow_data MINKNOW_DATA Location of MinKNOW data root. Default: /var/lib/minknow/data --max_memory Maximum memory (in MB) that you would like to provide to snakemake. Default: 53456MB --quiet Stop printing of snakemake commands to screen. --report Generate report (currently non-functional).">

                                       ,d
                                       88
              ,adPPYba,  ,adPPYYba, MM88MMM
              a8"     "8a ""     `Y8   88   
              8b       d8 ,adPPPPP88   88   
              "8a,   ,a8" 88,    ,88   88,  
              `"YbbdP"'  `"8bbdP"Y8   "Y888

        OAT: ONT Analysis Toolkit (version 0.2.0)

usage: oat [options] 
          
           

A pipeline for sequencing and analysis of viral genomes using an ONT MinION

positional arguments:
  samples_file          Path to file with sample metadata (.csv format). See
                        example spreadsheet for minimum necessary information.

optional arguments:
  -h, --help            show this help message and exit
  -b 
           
            , --barcode_kit 
            
             
                        Barcode kit that you used: '12' (SQK-RBK004) or '96'
                        (SQK-RBK110-96) (default: 12)
  -c 
             
              , --consensus_freq 
              
                Variant allele frequency threshold for a variant to be incorporated into consensus genome. Variants below this frequency will be incorporated with an IUPAC ambiguity. Default: 0.8 -d, --demultiplex Demultiplex reads using guppy_barcoder. By default, assumes reads were already demultiplexed by MinKNOW. Reads are demultiplexed into the output directory. -f, --force Force overwriting of completed files in snakemake analysis (Default: files not overwritten) -n, --dry_run Dry run only -m rampart | analysis | all, --module rampart | analysis | all Pipeline module to run: 'rampart', 'analysis' or 'all' (rampart followed by analysis). Default: 'all' -o OUTDIR, --outdir OUTDIR Output directory. Default: /path/to/ont- analysis-toolkit/analysis_results + 'run_name' from samples spreadsheet --rampart_outdir RAMPART_OUTDIR Output directory. Default: /path/to/ont- analysis-toolkit/rampart_files -p, --print_dag Save directed acyclic graph (DAG) of workflow to outdir -r REFERENCE, --reference REFERENCE Reference genome to use: 'MN908947.3' (SARS-CoV-2), 'NC_006273.2' (CMV Merlin). Other references can be used, but the corresponding assembly (fasta) and annotation (gff3 from Ensembl) must be added to /home/cfos/miniconda3/envs/oat/lib/python3.9/site- packages/oat/references (Default: MN908947.3) -t 
               
                , --threads 
                
                  Number of threads to use -v VARIANT_CALLER, --variant_caller VARIANT_CALLER Variant caller to use. Choices: 'medaka-longshot'. Default: 'medaka-longshot' --create_envs_only Create conda environments for snakemake analysis, but do no further analysis. Useful for initial pipeline setup. Default: False --snv_min SNV_MIN Minimum variant allele frequency for an SNV to be kept Default: 0.2 --delete_reads Delete demultiplexed reads after analysis --redo_analysis Delete entire analysis output directory and contents for a fresh run --version show program's version number and exit --minknow_data MINKNOW_DATA Location of MinKNOW data root. Default: /var/lib/minknow/data --max_memory 
                 
                   Maximum memory (in MB) that you would like to provide to snakemake. Default: 53456MB --quiet Stop printing of snakemake commands to screen. --report Generate report (currently non-functional). 
                 
                
               
              
             
            
           
          

What does the pipeline do?

RAMPART Module

All input files for RAMPART are generated based on your input spreadsheet, and a web browser is launched to view the sequencing in real time.

Analysis Module

Reads are mapped to the relevant reference genome with minimap2. Amplicon primers are trimmed using samtools ampliconclip. Variants are called using medaka and longshot, followed by filtering and consensus genome assembly using bcftools. The amino acid consequences of SNPs are inferred using bcftools csq. If analysing SARS-CoV-2, lineages are inferred using pangolin. Finally, a variety of sample QC metrics are combined into a final QC file.

Other Notes

Protocols

A protocol for the amplicon scheme needs (a) to be installed in the pipeline, and (b) named in the run_data.csv spreadsheet for analyses to work correctly. The pipeline comes with the Midnight protocol for SARS-CoV-2 pre-installed (https://www.protocols.io/view/sars-cov2-genome-sequencing-protocol-1200bp-amplic-bwyppfvn). Adding additional protocols is fairly easy:

  1. Make a directory called /path/to/ont-analysis-toolkit/oat/protocols/ARTICV3 (needs to be in all caps)

  2. Make a directory within ARTICV3 called 'rampart'

    (a) Put the normal rampart files within that directory (genome.json, primers.json, protocol.json, references.fasta)

  3. Make a directory within ARTICV3 called 'schemes'

    (a) Put the 'scheme.bed' file with primer coordinates in the 'schemes' directory

  4. Make sure you're in /path/to/ont-analysis-toolkit/, then activate the conda environment and use the following command: pip install .

Done!

Amino acid consequences

For the amino acid consequences step to work, a requirement is an annotation file for the chosen reference genome. The annotations must be in gff3 format, and must be in the 'Ensembl flavour' of gff3. There is a script included in the repository that can convert an NCBI gff3 file into an 'Ensembl flavour' gff3 file: /path/to/ont-analysis-toolkit/oat/scripts/gff2gff.py.

Credits

  • When this pipeline is used, citations should be found for the programs used internally.
  • The gff3 file I included for SARS-CoV-2 was originally sent to me by Torsten Seemann.
  • Being new to using snakemake + wrapper scripts, I used pangolin as a guide for directory structure and rule creation - so thanks to them.
  • The analysis module was heavily influenced by the ARTIC team, especially the artic minion pipeline.
  • gff2gff.py is based on work by Damien Farrell https://dmnfarrell.github.io/bioinformatics/bcftools-csq-gff-format
You might also like...
Red Team Toolkit is an Open-Source Django Offensive Web-App which is keeping the useful offensive tools used in the red-teaming together.
Red Team Toolkit is an Open-Source Django Offensive Web-App which is keeping the useful offensive tools used in the red-teaming together.

RedTeam Toolkit Note: Only legal activities should be conducted with this project. Red Team Toolkit is an Open-Source Django Offensive Web-App contain

A Static Analysis Tool for Detecting Security Vulnerabilities in Python Web Applications
A Static Analysis Tool for Detecting Security Vulnerabilities in Python Web Applications

This project is no longer maintained March 2020 Update: Please go see the amazing Pysa tutorial that should get you up to speed finding security vulne

SpiderFoot automates OSINT collection so that you can focus on analysis.
SpiderFoot automates OSINT collection so that you can focus on analysis.

SpiderFoot is an open source intelligence (OSINT) automation tool. It integrates with just about every data source available and utilises a range of m

Yuyu Scanner is a Web Reconnaissance & Web Analysis Scanner to find assets and information about targets.
Yuyu Scanner is a Web Reconnaissance & Web Analysis Scanner to find assets and information about targets.

Yuyu Scanner Yuyu Scanner is a Web Reconnaissance & Web Analysis Scanner to find assets and information about targets. installation ! run as root

ThePhish: an automated phishing email analysis tool
ThePhish: an automated phishing email analysis tool

ThePhish ThePhish is an automated phishing email analysis tool based on TheHive, Cortex and MISP. It is a web application written in Python 3 and base

Lazarus analysis tools and research report
Lazarus analysis tools and research report

Lazarus Research This repository publishes analysis reports and analysis tools for Operation Dream Job and Operation JTrack for Lazarus. Tools Python

IDA scripts for hypervisor (Hyper-v) analysis and reverse engineering automation
IDA scripts for hypervisor (Hyper-v) analysis and reverse engineering automation

Re-Scripts IA32-VMX-Helper (IDA-Script) IA32-MSR-Decoder (IDA-Script) IA32 VMX Helper It's an IDA script (Updated IA32 MSR Decoder) which helps you to

Android Malware (Analysis | Scoring) System
Android Malware (Analysis | Scoring) System

An Obfuscation-Neglect Android Malware Scoring System Quark-Engine is also bundled with Kali Linux, BlackArch. A trust-worthy, practical tool that's r

Malware arcane - Scripts and notes on my malware analysis journey

Malware Arcane Repository of notes and scripts I use when doing malware analysis

Releases(v0.10.1)
  • v0.10.1(Apr 27, 2022)

    While the 'oat' pipeline has undergone continuous development since its inception, this is the first designated release (well, pre-release). The pipeline should be fully functional, but please let me know if you encounter any errors or bugs.

    The pipeline has the most incorporated features for SARS-CoV-2, but works well with other viruses if you install them properly as per the instructions in the README.md file.

    Source code(tar.gz)
    Source code(zip)
Searches filesystem for CVE-2021-44228 and CVE-2021-45046 vulnerable instances of log4j library, including embedded (jar/war/zip) packaged ones.

log4shell_finder Python port of https://github.com/mergebase/log4j-detector log4j-detector is copyright (c) 2021 - MergeBase Software Inc. https://mer

Hynek Petrak 33 Jan 04, 2023
This is an advanced backdoor, created with Python

Backdoor This is a Backdoor, created with Python 3. Types of Commands: Downloading / Uploading files. Launching / Deleting / Reading file's content. S

swagkarna 28 Oct 28, 2022
A small utility to deal with malware embedded hashes.

Uchihash is a small utility that can save malware analysts the time of dealing with embedded hash values used for various things such as: Dyn

Abdallah Elshinbary 48 Dec 19, 2022
We protect the privacy of the data on your computer by using the camera of your Debian based Pardus operating system. 🕵️

Pardus Lookout We protect the privacy of the data on your computer by using the camera of your Debian based Pardus operating system. The application i

Ahmet Furkan DEMIR 19 Nov 18, 2022
Python program that generates secure passwords.

Python program that generates secure passwords. The user has the option to select the length of the password, amount of passwords,

4 Dec 07, 2021
How to exploit a double free vulnerability in 2021. 'Use-After-Free for Dummies'

This bug doesn’t exist on x86: Exploiting an ARM-only race condition How to exploit a double free and get a shell. "Use-After-Free for dummies" In thi

Stephen Tong 1.2k Dec 25, 2022
Script Crack Facebook Yang Kaya Akan Teh Hijau 🚶‍♂

r-mbf Script Crack Facebook 🚶‍♂ Bukti Recode [•] Install Script $ pkg update && pkg upgrade $ pkg install python $ pkg install git $ pip install requ

O'Hayo Smrn 3 Apr 02, 2022
Worm/Trojan/Ransomware/apt/Rootkit/Virus Database

Pestilence - The Malware Database [] Screenshot Pestilence is a project created to make the possibility of malware analysis open and available to the

*ERR0R* 47 Dec 21, 2022
Receive notifications/alerts on the most recent disclosed CVE's.

Receive notifications on the most recent disclosed CVE's.

Ameliorate 7 Nov 24, 2022
DNSSEQ: PowerDNS with FALCON Signature Scheme

PowerDNS-based proof-of-concept implementation of DNSSEC using the post-quantum FALCON signature scheme.

Nils Wisiol 4 Feb 03, 2022
Simple tool to create passwords.

PasswordGenerator Simple password generator: -Simplisitc Window Application -Allows Numbers, Symbols & letters upper and lowercase -Restricts rows of

DM 1 Jan 10, 2022
All in One CRACKER911181's Tool. This Tool For Hacking and Pentesting.🎭

This is A Python & Bash Programming Based Termux-Tool Created By CRACKER911181. This Tool Created For Hacking and Pentesting. If You Use This Tool To Evil Purpose,The Owner Will Never be Responsible

CRACKER911181 1 Jan 10, 2022
HatSploit collection of generic payloads designed to provide a wide range of attacks without having to spend time writing new ones.

HatSploit collection of generic payloads designed to provide a wide range of attacks without having to spend time writing new ones.

EntySec 5 May 10, 2022
Send CVE information to the specified mailbox (from Github)

Send CVE information to the specified mailbox (from Github)

91 Nov 08, 2022
Spring-0day/CVE-2022-22965

CVE-2022-22965 Spring Framework/CVE-2022-22965 Vulnerability ID: CVE-2022-22965/CNVD-2022-23942/QVD-2022-1691 Reproduce the vulnerability docker pull

iak 4 Apr 05, 2022
Exploiting CVE-2021-44228 in VMWare Horizon for remote code execution and more.

Log4jHorizon Exploiting CVE-2021-44228 in VMWare Horizon for remote code execution and more. BLOG COMING SOON Code and README.md this time around are

96 Dec 14, 2022
logmap: Log4j2 jndi injection fuzz tool

logmap - Log4j2 jndi injection fuzz tool Used for fuzzing to test whether there are log4j2 jndi injection vulnerabilities in header/body/path Use http

之乎者也 67 Oct 25, 2022
集成crawlergo、xray、dirsearch、nmap等工具的src漏洞挖掘工具,使用docker封装运行;

tools下有几个工具,所以项目文件比较大,如果下载总是中断的话建议拆开下载各个项目然后直接拷贝dockefile和recon.py即可 0x01 hscan介绍 hscan是什么 hscan是一款旨在使用一条命令替代渗透前的多条扫描命令,通过集成crawlergo扫描和xray扫描、dirsear

102 Jan 04, 2023
Getting my gitlab commit history into github

🔰 ᵀᴱᴸᴱᴳᴿᴬᴹ ᴴᴬᶜᴷ ᴮᴼᵀ 🔰 The owner would not be responsible for any kind of bans due to the bot. • ⚡ INSTALLING ⚡ • • 🛠️ Lᴀɴɢᴜᴀɢᴇs Aɴᴅ Tᴏᴏʟs 🔰 • If

Santiago Chiesa 1 Dec 24, 2021