Find Transposon Element insertions using long reads (nanopore), by alignment directly. (minimap2)

Last update: Feb 09, 2022

Related tags

Overview

find_te_ins

find_te_ins is designed to find Transposon Element (TE) insertions using long reads (nanopore), by alignment directly. (minimap2)

Install

$ git clone https://github.com/bakerwm/find_te_ins.git
$ cd find_te_ins

Change the following variables upon your condition: genome_fa and te_fa in line-10 and line-11;

$ bash run_pipe.sh
run_pipe.sh

Prerequisite

minimap2 - 2.17-r974-dirty, align long reads to reference genome
featureCounts - v2.0.0, quantification
samtools - v1.12, working with BAM files
python 3.8+
pysam 0.16.0.1, python module, working with BAM files

Getting Started

1 Prepare input files

genome_fa - reference genome in fasta format, in script run_pipe.sh, line-10
te_fa - TE consensus sequence in fasta format, in script run_pipe.sh, line-11
long reads - Long reads from NanoPore or Pacbio, in fasta or fastq format

2 Run pipe

$ cd ~/work/te_ins
# specify the path of long reads data: 
   
    /
   
$ git clone https://github.com/bakerwm/find_te_ins.git 
$ bash find_te_ins/run_pipe.sh <path-to-long-reads>/ results

[1/9] align to reference genome
[2/9] extract raw insertions from BAM, by CIGAR
[3/9] convert raw insertions to fasta format
[4/9] align raw_insertion to transposon
[5/9] extract transposon name for insertions
[6/9] merge raw_insertions by window=100
[7/9] count reads for each insertion
[8/9] save final insertions to file
[9/9] Done!

3 Output

The following files listed below are the output of the pipeline, the TE insertions saved in file *.te_ins.final.bed

$ tree -L 2 results/ONT_sample-1
.
├── ONT_sample-1
│   ├── ONT_sample-1.bam
│   ├── ONT_sample-1.bam.bai
│   ├── ONT_sample-1.raw_ins.bed
│   ├── ONT_sample-1.raw_ins.fa
│   ├── ONT_sample-1.raw_ins.fa.bam
│   ├── ONT_sample-1.raw_ins.fa.bam.bai
│   ├── ONT_sample-1.te_ins.bed
│   ├── ONT_sample-1.te_ins.final.bed
│   ├── ONT_sample-1.te_ins.final.bed6
│   ├── ONT_sample-1.te_ins.gtf
│   ├── ONT_sample-1.te_ins.quant.stderr
│   ├── ONT_sample-1.te_ins.quant.stdout
│   ├── ONT_sample-1.te_ins.quant.txt
│   ├── ONT_sample-1.te_ins.quant.txt.summary
│   ├── ONT_sample-1.te_ins.raw.txt
│   ├── run_minimap2.dm6.stderr
│   └── run_minimap2.dm6_transposon.stderr
...

{sample_name}.te_ins.final.bed

column 1. chr name of reference 
column 2. start pos of Insertion 
column 3. end pos of Insertion 
column 4. insertion name 
column 5. a fixed integer [255]  
column 6. strand # in current version, not consider the dirction of TE insertions !!!
column 7. name of TE consensus 
column 8. length of TE consensus  
column 9. proportion of the TE consensus identified  
column 10. number of supported reads for the insertion 
column 11. number of all reads cover the insertion 
column 12. proportion TE supported reads 
column 13. type of the TE insertions [full, p3, p5]

{sample_name}.te_ins.raw.txt

column 16 (last column), is the type of TE insertions: [full, p3, p5]

full, more then cutoff [60%] of the TE consensus were detected
p3, only the 3' end of the TE consensus were detected
p5, only the 5' end of the TE consensus were detected

In the .final.bed file, ONLY full TE insertions were saved for further analysis

Change criteria

TE types were defined in run_pipe.sh by anno_te.py, the criteria -c 0.6 could be changed to [0-1] float number based on your condition. see line-100 in file run_pipe.sh

# line-100 of run_pipe.sh
[[ ! -f ${te_ins_txt} ]] && python ${src_dir}/anno_te.py -x ${te_fa_fai} ${te_bam} | sort -k4,4 -k5,5n > ${te_ins_txt}

# change criteria to 0.7
[[ ! -f ${te_ins_txt} ]] && python ${src_dir}/anno_te.py -x ${te_fa_fai} -c 0.7 ${te_bam} | sort -k4,4 -k5,5n > ${te_ins_txt}

# remove te_ins files, and run the command again
$ rm results/ONT_sample-1.te_ins*
$ bash find_te_ins/run_pipe.sh 
   
    / results

How it works?

extract INSERTIONS

Find Transposon Element insertions using long reads (nanopore), by alignment directly. (minimap2)

Related tags

Overview

find_te_ins

Install

Prerequisite

Getting Started

1 Prepare input files

2 Run pipe

3 Output

Change criteria

How it works?

Owner

Ming Wang

Blender addon that enables exporting of xmodels from blender. Great for custom asset creation for cod games

Convert-Decimal-to-Binary-Octal-and-Hexadecimal

NUM Alert - A work focus aid created for the Hack the Job hackathon

Curses frontend for Canto daemon

My solutions for Advent of Code 2021 🌟🎄

Show Public IP Information In Linux Taskbar

Repositorio com arquivos processados da CPI da COVID para facilitar analise

IPython: Productive Interactive Computing

FollowSpot is a comprehensive audition tracking fullstack web application for entertainment industry professionals.

Code for the manim-generated scenes used in 3blue1brown videos

Catalogue CRUD Application

The earliest beta version of pytgcalls on Linux x86_64 and ARM64! Use in production at your own risk!

An implementation of multimap with per-item expiration backed up by Redis.

Python Create Your Own Tool Series

SimCSE在中文任务上的简单实验

Sodium is a general purpose programming language which is instruction-oriented

dbt adapter for Firebolt

python3 scrip for case conversion of source code files writen in fixed form fortran

Demo Python project using Conda and Poetry

Fetch data from an excel file and create HTML file