SNV calling pipeline developed explicitly to process individual or trio vcf files obtained from Illumina based pipeline (grch37/grch38).

Overview

SNV Pipeline

SNV calling pipeline developed explicitly to process individual or trio vcf files obtained from Illumina based pipeline (grch37/grch38). The pipeline requires user defined datasets & annotation sources, available tools and input set of vcf files. It generates analysis scripts that can be incorporated into high performance cluster (HPC) computing to process the samples. This results in list of filtered variants per family that can be used for interpreation, reporting and further downstream analysis.

For demonstration purpose below example is presented for GRCh37. However, the same can be replicated for GRCh38.

Installation

git clone https://github.com/ajaarma/snv.git

Required Installation packages

Install anaconda v2.0
Follow this link for installation: https://docs.anaconda.com/anaconda/install/linux/
Conda environment commands
$ conda create --name snv
$ source activate snv
$ conda install python=2.7.16
$ pip install xmltodict
$ pip install dicttoxml

$ conda install -c bioconda gvcfgenotyper
$ conda install -c anaconda gawk	
$ conda install samtools=1.3
$ conda install vcftools=0.1.14
$ conda install bcftools=1.9
$ conda install gcc #(OSX)
$ conda install gcc_linux-64 #(Linux)
$ conda install parallel
$ conda install -c mvdbeek ucsc_tools
** conda-develop -n 
    
    
     /demo/softwares/vep/Plugins/

$ conda install -c r r-optparse
$ conda install -c r r-dplyr
$ conda install -c r r-plyr
$ conda install -c r r-data.table
$ conda install -c aakumar r-readbulk
$ conda install -c bioconda ensembl-vep=100.4
$ vep_install -a cf -s homo_sapiens -y GRCh37 -c 
     
      /demo/softwares/vep/grch37 --CONVERT
$ vep_install -a cf -s homo_sapiens -y GRCh38 -c 
      
       /demo/softwares/vep/grch38 --CONVERT

      
     
    
   

Data directory and datasets

Default datasets provided
1. exac_pli: demo/resources/gnomad/grch37/gnomad.v2.1.1.lof_metrics.by_transcript_forVEP.txt
2. ensembl: demo/resources/ensembl/grch37/ensBioMart_grch37_v98_ENST_lengths_191208.txt
3. region-exons: demo/resources/regions/grch37/hg19_refseq_ensembl_exons_50bp_allMT_hgmd_clinvar_20200519.txt
4. region-pseudo-autosomal: demo/resources/regions/grch37/hg19_non_pseudoautosomal_regions_X.txt
5. HPO: demo/resources/hpo/phenotype_to_genes.tar.gz
Other datasets that require no entry to user-configuration file
6. Curated: 
	6.1. Genelist: demo/resources/curated/NGC_genelist_allNamesOnly-20200519.txt
	6.2. Somatic mosaicism genes: demo/resources/curated/haem_somatic_mosaicism_genes_20191015.txt
	6.3. Imprinted gene list: demo/resources/curated/imprinted_genes_20200424.txt
	6.4. Polymorphic gene list: demo/resources/curated/polymorphic_genes_20200509.txt
7. OMIM: demo/resources/omim/omim_20200421_geneInfoBase.txt

Download link for following dataset and place them in corresponding directories as shown

' | awk -v OFS="\t" '{ if(/^#/){ print }else{ print $1,$2,$3,$4,$5,$6,$7,"ID="$3";"$8 } }' | bgzip -c > hgmd_pro_2019.4_hg19_wID.vcf.gz $ bcftools index -t hgmd_pro_2019.4_hg19_wID.vcf.gz $ bcftools index hgmd_pro_2019.4_hg19_wID.vcf.gz Put it in this directory: demo/resources/hgmd/grch37/hgmd_pro_2019.4_hg19_wID.vcf.gz Edit the user config flat file CONFIG/UserConfig.txt : hgmd=hgmd/grch37/hgmd_pro_2019.4_hg19_wID.vcf.gz 8. CLINVAR: Download link: https://ftp.ncbi.nlm.nih.gov/pub/clinvar/vcf_GRCh37/weekly/clinvar_20200506.vcf.gz https://ftp.ncbi.nlm.nih.gov/pub/clinvar/vcf_GRCh37/weekly/clinvar_20200506.vcf.gz.tbi Put it in this directory: demo/resources/clinvar/grch37/clinvar_20200506.vcf.gz Edit the user config flat file CONFIG/UserConfig.txt : clinvar=clinvar/grch37/clinvar_20200506.vcf.gz ">
1. HPO: Extract HPO phenotypes mapping:
	$ cd 
   
    /demo/resources/hpo/
	$ tar -zxvf phenotypes_to_genes.tar.gz 

2. REFERENCE SEQUENCE GENOME (FASTA file alongwith Index)
	Download link: https://drive.google.com/drive/folders/1Ro3pEYhVdYkMmteSr8YRPFeTvb_K0lVf?usp=sharing
	Download file: Homo_sapiens.GRCh37.74.dna.fasta
		Get the corresponding index and dict files: *.fai and *.dict
	Put this in folder: demo/resources/genomes/grch37/Homo_sapiens.GRCh37.74.dna.fasta

3. GNOMAD
	Download link (use wget): 
	Genomes: https://storage.googleapis.com/gnomad-public/release/2.1.1/vcf/genomes/gnomad.genomes.r2.1.1.sites.vcf.bgz
	Exomes: https://storage.googleapis.com/gnomad-public/release/2.1.1/vcf/exomes/gnomad.exomes.r2.1.1.sites.vcf.bgz
	Put it in this folder: 
		demo/resources/gnomad/grch37/gnomad.genomes.r2.1.1.sites.vcf.bgz
		demo/resources/gnomad/grch37/gnomad.exomes.r2.1.1.sites.vcf.bgz
	Edit User config flat file CONFIG/UserConfig.txt : 
		gnomad_g=gnomad/grch37/gnomad.genomes.r2.1.1.sites.vcf.bgz
		gnomad_e=gnomad/grch37/gnomad.exomes.r2.1.1.sites.vcf.bgz

4. ExAC:
	Download Link: https://drive.google.com/drive/folders/11Ya8XfAxOYmlKZ9mN8A16IDTLHdHba_0?usp=sharing
	Download file: ExAC.r0.3.1.sites.vep.decompose.norm.prefixed_PASS-only.vcf.gz
		also the index files (*.csi and *.tbi)
	Put it in this folder as: 
		demo/resources/exac/grch37/ExAC.r0.3.1.sites.vep.decompose.norm.prefixed_PASS-only.vcf.gz
	Edit User config flat file CONFIG/UserConfig.txt : 
		exac=exac/grch37/ExAC.r0.3.1.sites.vep.decompose.norm.prefixed_PASS-only.vcf.gz
		exac_t=exac/grch37/ExAC.r0.3.1.sites.vep.decompose.norm.prefixed_PASS-only.vcf.gz

5. CADD:
	Download link (use wget):
		https://krishna.gs.washington.edu/download/CADD/v1.6/GRCh37/whole_genome_SNVs.tsv.gz
		https://krishna.gs.washington.edu/download/CADD/v1.6/GRCh37/InDels.tsv.gz
		(Also download the corresponding tabix index files as well)
	Put it in this directory: 
		demo/resources/cadd/grch37/whole_genome_SNVs.tsv.gz
		demo/resource/cadd/grch37/InDels.tsv.gz
	Edit the user config flat file CONFIG/UserConfig.txt :
		cadd_snv=cadd/grch37/whole_genome_SNVs.tsv.gz
		cadd_indel=cadd/grch37/InDels.tsv.gz

6. REVEL:
	Download link: https://drive.google.com/drive/folders/12Tl1YU5bI-By_VawTPVWHef7AXzn4LuP?usp=sharing
	Download file: new_tabbed_revel.tsv.gz
	         Also the index file: *.tbi
	Put it in this directory: demo/resources/revel/grch37/new_tabbed_revel.tsv.gz
	Edit the user config flat file CONFIG/UserConfig.txt : 
		revel=revel/grch37/new_tabbed_revel.tsv.gz

7. HGMD:
	Download link: http://www.hgmd.cf.ac.uk/ac/index.php (Require personal access login)
	Put it in this directory: demo/resources/hgmd/grch37/hgmd_pro_2019.4_hg19.vcf

	Use this command to process HGMD file inside this directory:
		$ cat hgmd_pro_2019.4_hg19.vcf | sed '/##comment=.*\"/a  ##INFO=
    
     ' | awk -v OFS="\t" '{ if(/^#/){ print }else{ print $1,$2,$3,$4,$5,$6,$7,"ID="$3";"$8 } }' | bgzip -c  > hgmd_pro_2019.4_hg19_wID.vcf.gz
		$ bcftools index -t hgmd_pro_2019.4_hg19_wID.vcf.gz
		$ bcftools index hgmd_pro_2019.4_hg19_wID.vcf.gz	

	Put it in this directory: demo/resources/hgmd/grch37/hgmd_pro_2019.4_hg19_wID.vcf.gz
	Edit the user config flat file CONFIG/UserConfig.txt :
		hgmd=hgmd/grch37/hgmd_pro_2019.4_hg19_wID.vcf.gz

8. CLINVAR:
	Download link: 
		https://ftp.ncbi.nlm.nih.gov/pub/clinvar/vcf_GRCh37/weekly/clinvar_20200506.vcf.gz
		https://ftp.ncbi.nlm.nih.gov/pub/clinvar/vcf_GRCh37/weekly/clinvar_20200506.vcf.gz.tbi
	Put it in this directory: demo/resources/clinvar/grch37/clinvar_20200506.vcf.gz
	Edit the user config flat file CONFIG/UserConfig.txt :
		clinvar=clinvar/grch37/clinvar_20200506.vcf.gz

    
   
Customized Curated Annotation sets
Default present with this distribution. Can be found in XML file with these tags:
	(1) GeneList: 
   
    
	(2) Somatic mosaicism genes: 
    
     
	(3) Imprinted genes: 
     
      
	(4) Polymorphic genes: 
      
       
	(5) HPO terms: 
       
         (6) OMIM: 
         
        
       
      
     
    
   

Activate the conda environment

$ source activate snv

Step - 1:

1. Edit CONFIG/UserConfig.txt: 
	(a) Add the absolute path prefix for the resources directory with tag: resourceDir. 
	    An example can be seen in CONFIG/Example-UserConfig.txt file.
	(b) Manually check if datasets corresponding to other field tags are correctly downloaded and 
	    put in respective folders.
2. Create user defined XML file from input User Configuration flat file and Base-XML file
Command:
$ python createAnalysisXML.py -u 
   
     
		    	      -b 
    
      
		              -o 
     

     
    
   
Example:
$ python createAnalysisXML.py -u CONFIG/UserConfig.txt 
		       	      -b CONFIG/Analysis_base_grch37.xml 
		              -o CONFIG/Analysis_user_grch37.xml
Outputs:
CONFIG/Analysis_user_grch37.xml

Step-2:

1. Put the respective vcf files in the directory. For example: demo/example/vcf/ 
2. Create manifest file in same format as shown in demo/example/example_manifest.txt
3. Assign gender to each family members (illumina or sample id). For example: demo/example/example_genders.txt
4. List of all the family ids that needed to be analyzed.
         For e.g: demo/example/manifest/example_family_analysis.txt

Step -3:

Generate all the shell scripts that can be incorporated into user specific HPC cluster network. For e.g: Slurm/PBS/LSF network.

Command:
$ python processSNV.py 	-a 
   
    
	    		-p 
    
     
	      		-m 
     
      
	     		-e 
      
       
			-w 
       
         -g 
        
          -d 
         
           -f 
          
            -s 
           
             -r 
             
            
           
          
         
        
       
      
     
    
   
Example:
$ python processSNV.py 	-a CONFIG/Analysis_user_grch37.xml \ 
			-p 20210326 \
			-m 
   
    /demo/example/example_manifest.txt \
			-e gvcfGT \
			-w 
    
     /demo/example/ \
			-g 
     
      /demo/example/example_genders.txt \
			-d 
      
       /demo/example/exeter_samples_norm.fof \
			-f 
       
        /demo/example/manifest/example_family_analysis.txt \ -s 
        
         /demo/example/manifest/example_family.fof \ -r 
         
          /demo/example/manifest/example_family_header.txt (optional) 
         
        
       
      
     
    
   
Outputs:
Two scripts in the directory: 
   
    /demo/example/20210326/tmp_binaries/
Launch the scripts in these 2 stages sequentially after each of them gets finished.

   (1) genotypeAndAnnotate_chr%.sh where %=1..22,X,Y and MT
	scatter the annotation and frequency filtering per chromosome for all families.
   (2) mergeAndFilter.sh:
	Merge all the chromosome and apply inheritance filtering.

   

Step-4

Final output of list of filtered variant is present in:

   
    /demo/example/20210326/fam_filter/
    
     /
     
      .filt_
      
       .txt

      
     
    
   
For any questions/issues/bugs please mail us at:
Owner
East Genomics
Bringing together genomic medicine across the East Midlands and East of England
East Genomics
A forecasting system dedicated to smart city data

smart-city-predictions System prognostyczny dedykowany dla danych inteligentnych miast Praca inżynierska realizowana przez Michała Stawikowskiego and

Kevin Lai 1 Nov 08, 2021
Tools for the analysis, simulation, and presentation of Lorentz TEM data.

ltempy ltempy is a set of tools for Lorentz TEM data analysis, simulation, and presentation. Features Single Image Transport of Intensity Equation (SI

McMorran Lab 1 Dec 26, 2022
PandaPy has the speed of NumPy and the usability of Pandas 10x to 50x faster (by @firmai)

PandaPy "I came across PandaPy last week and have already used it in my current project. It is a fascinating Python library with a lot of potential to

Derek Snow 527 Jan 02, 2023
PyEmits, a python package for easy manipulation in time-series data.

PyEmits, a python package for easy manipulation in time-series data. Time-series data is very common in real life. Engineering FSI industry (Financial

Thompson 5 Sep 23, 2022
Spectral Analysis in Python

SPECTRUM : Spectral Analysis in Python contributions: Please join https://github.com/cokelaer/spectrum contributors: https://github.com/cokelaer/spect

Thomas Cokelaer 280 Dec 16, 2022
An orchestration platform for the development, production, and observation of data assets.

Dagster An orchestration platform for the development, production, and observation of data assets. Dagster lets you define jobs in terms of the data f

Dagster 6.2k Jan 08, 2023
Implementation in Python of the reliability measures such as Omega.

reliabiliPy Summary Simple implementation in Python of the [reliability](https://en.wikipedia.org/wiki/Reliability_(statistics) measures for surveys:

Rafael Valero Fernández 2 Apr 27, 2022
CaterApp is a cross platform, remotely data sharing tool created for sharing files in a quick and secured manner.

CaterApp is a cross platform, remotely data sharing tool created for sharing files in a quick and secured manner. It is aimed to integrate this tool with several more features including providing a U

Ravi Prakash 3 Jun 27, 2021
Senator Trades Monitor

Senator Trades Monitor This monitor will grab the most recent trades by senators and send them as a webhook to discord. Installation To use the monito

Yousaf Cheema 5 Jun 11, 2022
Galvanalyser is a system for automatically storing data generated by battery cycling machines in a database

Galvanalyser is a system for automatically storing data generated by battery cycling machines in a database, using a set of "harvesters", whose job it

Battery Intelligence Lab 20 Sep 28, 2022
Mining the Stack Overflow Developer Survey

Mining the Stack Overflow Developer Survey A prototype data mining application to compare the accuracy of decision tree and random forest regression m

1 Nov 16, 2021
NFCDS Workshop Beginners Guide Bioinformatics Data Analysis

Genomics Workshop FIXME: overview of workshop Code of Conduct All participants s

Elizabeth Brooks 2 Jun 13, 2022
Hydrogen (or other pure gas phase species) depressurization calculations

HydDown Hydrogen (or other pure gas phase species) depressurization calculations This code is published under an MIT license. Install as simple as: pi

Anders Andreasen 13 Nov 26, 2022
Analytical view of olist e-commerce in Brazil

Analysis of E-Commerce Public Dataset by Olist The objective of this project is to propose an analytical view of olist e-commerce in Brazil. For this

Gurpreet Singh 1 Jan 11, 2022
Minimal working example of data acquisition with nidaqmx python API

Data Aquisition using NI-DAQmx python API Based on this project It is a minimal working example for data acquisition using the NI-DAQmx python API. It

Pablo 1 Nov 05, 2021
Used for data processing in machine learning, and help us to construct ML model more easily from scratch

Used for data processing in machine learning, and help us to construct ML model more easily from scratch. Can be used in linear model, logistic regression model, and decision tree.

ShawnWang 0 Jul 05, 2022
Learn machine learning the fun way, with Oracle and RedBull Racing

Red Bull Racing Analytics Hands-On Labs Introduction Are you interested in learning machine learning (ML)? How about doing this in the context of the

Oracle DevRel 55 Oct 24, 2022
University Challenge 2021 With Python

University Challenge 2021 This repository contains: The TeX file of the technical write-up describing the University / HYPER Challenge 2021 under late

2 Nov 27, 2021
A DSL for data-driven computational pipelines

"Dataflow variables are spectacularly expressive in concurrent programming" Henri E. Bal , Jennifer G. Steiner , Andrew S. Tanenbaum Quick overview Ne

1.9k Jan 03, 2023
Python package for processing UC module spectral data.

UC Module Python Package How To Install clone repo. cd UC-module pip install . How to Use uc.module.UC(measurment=str, dark=str, reference=str, heade

Nicolai Haaber Junge 1 Oct 20, 2021