Hg002-qc-snakemake - HG002 QC Snakemake

Overview

HG002 QC Snakemake

To Run

Resources and data specified within snakefile (hg002QC.smk) for simplicity. Tested with snakemake v6.15.3.

Warning: Several steps of this workflow require minimum coverage. It's recommended that this workflow not be run when yield in base pairs is insufficient to produceat least 15X coverage (i.e. yield/3099922541 >= 15x).

# clone repo
git clone --recursive https://github.com/PacificBiosciences/pb-human-wgs-workflow-snakemake.git workflow

# make necessary directories
mkdir cluster_logs

# create conda environment
conda env create --file workflow/environment.yaml

# activate conda environment
conda activate pb-human-wgs-workflow

# submit job
sbatch workflow/run_hg002QC.sh

Plots

A list of important stats from target files that would be good for plotting.

targets = [f"conditions/{condition}/{filename}"
                    for condition in ubam_dict.keys()
                    for filename in ["smrtcell_stats/all_movies.read_length_and_quality.tsv",
                                    "hifiasm/asm.p_ctg.fasta.stats.txt",
                                    "hifiasm/asm.a_ctg.fasta.stats.txt",
                                    "hifiasm/asm.p_ctg.qv.txt",
                                    "hifiasm/asm.a_ctg.qv.txt",
                                    "truvari/summary.txt",
                                    "pbsv/all_chroms.pbsv.vcf.gz",
                                    "deepvariant/deepvariant.vcf.stats.txt",
                                    "whatshap/deepvariant.phased.tsv",
                                    "happy/all.summary.csv",
                                    "happy/all.extended.csv",
                                    "happy/cmrg.summary.csv",
                                    "happy/cmrg.extended.csv",
                                    "mosdepth/coverage.mosdepth.summary.txt",
                                    "mosdepth/mosdepth.M2_ratio.txt",
                                    "mosdepth/gc_coverage.summary.txt",
                                    "mosdepth/coverage.thresholds.summary.txt"]]
  • smrtcell_stats/all_movies.read_length_and_quality.tsv
    • outputs 3 columns (read name, read length, read quality)
    • boxplots of read length and quality
  • hifiasm/asm.p_ctg.fasta.stats.txt (primary) + hifiasm/asm.a_ctg.fasta.stats.txt (alternate)
    • all stats below should be collected for both primary (p_ctg) and alternate (p_atg) assemblies
    • assembly size awk '$1=="SZ" {print $2}' <filename>
    • auN (area under the curve) awk '$1=="AU" {print $2}' <filename>
    • NGx - line plot of NG10 through NG90 awk '$1=="NL" {print $2 $3}' <filename> ($2 is x-axis, $3 y-axis) like this: example plot
  • hifiasm/asm.p_ctg.qv.txt + hifiasm/asm.a_ctg.qv.txt
    • adjusted assembly quality awk '$1=="QV" {print $3}' <filename> for primary and alternate assemblies
  • truvari/truvari.summary.txt
    • structural variant recall jq .recall <filename>
    • structural variant precision jq .precision <filename>
    • structural variant f1 jq .f1 <filename>
    • number of calls jq '."call cnt"' <filename>
    • FP jq .FP <filename>
    • TP-call jq .TP-call <filename>
    • FN jq .FN <filename>
    • TP-base jq .TP-base <filename>
  • pbsv/all_chroms.pbsv.vcf.gz
    • counts of each type of variant bcftools query -i 'FILTER=="PASS"' -f '%INFO/SVTYPE\n' <filename> | awk '{A[$1]++}END{for(i in A)print i,A[i]}'
    • can also do size distributions of indels bcftools query -i 'FILTER=="PASS" && (INFO/SVTYPE=="INS" | INFO/SVTYPE=="DEL")' -f '%INFO/SVTYPE\t%INFO/SVLEN\n' <filename>
  • deepvariant/deepvariant.vcf.stats.txt
    • several values in lines starting with 'SN' awk '$1=="SN"' <filename>
      • number of SNPS
      • number INDELs
      • number of multi-allelic sites
      • number of multi-allelic SNP sites
    • ratio of transitions to transversions awk '$1=="TSTV" {print$5}' <filename>
    • can monitor substitution types awk '$1=="ST"' <filename>
    • SNP heterozygous : non-ref homozygous ratio awk '$1=="PSC" {print $6/$5}' <filename>
    • SNP transitions : transversions awk '$1=="PSC" {print $7/$8}' <filename>
    • Number of heterozygous insertions : number of homozgyous alt insertions awk '$1=="PSI" {print $8/$10}' <filename>
    • Number of heterozygous deletions : number of homozgyous alt deletions awk '$1=="PSI" {print $9/$11}' <filename>
    • Total INDEL heterozygous:homozygous ratio awk '$1=="PSI" {print ($8+$9)/($10+$11)}' <filename>8+9:10+11 indel het:hom)
  • whatshap/deepvariant.phased.tsv
    • phase block N50 awk '$2=="ALL" {print $22}' <filename>
    • bp_per_block_sum (total number of phased bases) awk '$2=="ALL" {print $18}' <filename>
  • whatshap/deepvariant.phased.blocklist
    • calculate phase block size (to - from) and reverse order them (awk 'NR>1 {print $5-$4}' <filename> |sort -nr), then plot as cumulative line graph like for assembly, N_0 to N90 example plot
  • happy/all.summary.csv + happy/cmrg.summary.csv
    • stats should be collected for all variants and cmrg challenging medically relevant genes
      • SNP recall awk -F, '$1=="SNP" && $2=="PASS" {print $10}' <filename>
      • SNP precision awk -F, '$1=="SNP" && $2=="PASS" {print $11}' <filename>
      • SNP F1 awk -F, '$1=="SNP" && $2=="PASS" {print $13}' <filename>
      • INDEL recall awk -F, '$1=="INDEL" && $2=="PASS" {print $10}' <filename>
      • INDEL precision awk -F, '$1=="INDEL" && $2=="PASS" {print $11}' <filename>
      • INDEL F1 awk -F, '$1=="INDEL" && $2=="PASS" {print $13}' <filename>
  • happy/all.extended.csv + happy/cmrg.extended.csv
    • there are many stratifications that can be examined, and Aaron Wenger might have opinionso n which are most important. The below commands are just for one stratification "GRCh38_lowmappabilityall.bed.gz".
    • SNP GRCh38_lowmappabilityall recall awk -F, '$1=="SNP" && $2=="*" && $3=="GRCh38_lowmappabilityall.bed.gz" && $4=="PASS" {print $8}' <filename>
    • SNP GRCh38_lowmappabilityall precision awk -F, '$1=="SNP" && $2=="*" && $3=="GRCh38_lowmappabilityall.bed.gz" && $4=="PASS" {print $9}' <filename>
    • SNP GRCh38_lowmappabilityall F1 awk -F, '$1=="SNP" && $2=="*" && $3=="GRCh38_lowmappabilityall.bed.gz" && $4=="PASS" {print $11}' <filename>
    • INDEL GRCh38_lowmappabilityall recall awk -F, '$1=="INDEL" && $2=="*" && $3=="GRCh38_lowmappabilityall.bed.gz" && $4=="PASS" {print $8}' <filename>
    • INDEL GRCh38_lowmappabilityall precision awk -F, '$1=="INDEL" && $2=="*" && $3=="GRCh38_lowmappabilityall.bed.gz" && $4=="PASS" {print $9}' <filename>
    • INDEL GRCh38_lowmappabilityall F1 awk -F, '$1=="INDEL" && $2=="*" && $3=="GRCh38_lowmappabilityall.bed.gz" && $4=="PASS" {print $11}' <filename>
  • mosdepth/coverage.mosdepth.summary.txt
    • mean aligned coverage in "coverage.mosdepth.summary.txt" - 4th column of final row, can grep 'total_region'
  • mosdepth/mosdepth.M2_ratio.txt
    • outputs single value: ratio of chr2 coverage to chrM coverage
    • bar chart of m2 ratio
  • mosdepth/gc_coverage.summary.txt
    • outputs 5 columns: gc percentage bin, q1 , median , q3 , count
    • q1, median, q3 columns are statistics for coverage at different gc percentages (e.g. median cover at 30% GC)
    • "count" refers to # of 500 bp windows that fall in that bin
    • can pick a couple of key GC coverage bins and make box plots out of them
  • mosdepth/coverage.thresholds.summary.txt
    • outputs 10 columns corresponding to % of genome sequenced to minimum coverage depths (1X - 10X)
    • maybe a line chart comparing the different coverage thresholds among conditions
Owner
Juniper A. Lake
Bioinformatics Scientist
Juniper A. Lake
Finds, downloads, parses, and standardizes public bikeshare data into a standard pandas dataframe format

Finds, downloads, parses, and standardizes public bikeshare data into a standard pandas dataframe format.

Brady Law 2 Dec 01, 2021
TextDescriptives - A Python library for calculating a large variety of statistics from text

A Python library for calculating a large variety of statistics from text(s) using spaCy v.3 pipeline components and extensions. TextDescriptives can be used to calculate several descriptive statistic

150 Dec 30, 2022
CRISP: Critical Path Analysis of Microservice Traces

CRISP: Critical Path Analysis of Microservice Traces This repo contains code to compute and present critical path summary from Jaeger microservice tra

Uber Research 110 Jan 06, 2023
Statsmodels: statistical modeling and econometrics in Python

About statsmodels statsmodels is a Python package that provides a complement to scipy for statistical computations including descriptive statistics an

statsmodels 8k Dec 29, 2022
Python data processing, analysis, visualization, and data operations

Python This is a Python data processing, analysis, visualization and data operations of the source code warehouse, book ISBN: 9787115527592 Descriptio

FangWei 1 Jan 16, 2022
[CVPR2022] This repository contains code for the paper "Nested Collaborative Learning for Long-Tailed Visual Recognition", published at CVPR 2022

Nested Collaborative Learning for Long-Tailed Visual Recognition This repository is the official PyTorch implementation of the paper in CVPR 2022: Nes

Jun Li 65 Dec 09, 2022
Full ELT process on GCP environment.

Rent Houses Germany - GCP Pipeline Project: The goal of the project is to extract data about house rentals in Germany, store, process and analyze it u

Felipe Demenech Vasconcelos 2 Jan 20, 2022
DefAP is a program developed to facilitate the exploration of a material's defect chemistry

DefAP is a program developed to facilitate the exploration of a material's defect chemistry. A large number of features are provided and rapid exploration is supported through the use of autoplotting

6 Oct 25, 2022
MS in Data Science capstone project. Studying attacks on autonomous vehicles.

Surveying Attack Models for CAVs Guide to Installing CARLA and Collecting Data Our project focuses on surveying attack models for Connveced Autonomous

Isabela Caetano 1 Dec 09, 2021
Data science/Analysis Health Care Portfolio

Health-Care-DS-Projects Data Science/Analysis Health Care Portfolio Consists Of 3 Projects: Mexico Covid-19 project, analyze the patient medical histo

Mohamed Abd El-Mohsen 1 Feb 13, 2022
A Python package for Bayesian forecasting with object-oriented design and probabilistic models under the hood.

Disclaimer This project is stable and being incubated for long-term support. It may contain new experimental code, for which APIs are subject to chang

Uber Open Source 1.6k Dec 29, 2022
Processo de ETL (extração, transformação, carregamento) realizado pela equipe no projeto final do curso da Soul Code Academy.

Processo de ETL (extração, transformação, carregamento) realizado pela equipe no projeto final do curso da Soul Code Academy.

Débora Mendes de Azevedo 1 Feb 03, 2022
Display the behaviour of a realtime program with a scope or logic analyser.

1. A monitor for realtime MicroPython code This library provides a means of examining the behaviour of a running system. It was initially designed to

Peter Hinch 17 Dec 05, 2022
A meta plugin for processing timelapse data timepoint by timepoint in napari

napari-time-slicer A meta plugin for processing timelapse data timepoint by timepoint. It enables a list of napari plugins to process 2D+t or 3D+t dat

Robert Haase 2 Oct 13, 2022
Bigdata Simulation Library Of Dream By Sandman Books

BIGDATA SIMULATION LIBRARY OF DREAM BY SANDMAN BOOKS ================= Solution Architecture Description In the realm of Dreaming, its ruler SANDMAN,

Maycon Cypriano 3 Jun 30, 2022
Demonstrate a Dataflow pipeline that saves data from an API into BigQuery table

Overview dataflow-mvp provides a basic example pipeline that pulls data from an API and writes it to a BigQuery table using GCP's Dataflow (i.e., Apac

Chris Carbonell 1 Dec 03, 2021
This mini project showcase how to build and debug Apache Spark application using Python

Spark app can't be debugged using normal procedure. This mini project showcase how to build and debug Apache Spark application using Python programming language. There are also options to run Spark a

Denny Imanuel 1 Dec 29, 2021
Package for decomposing EMG signals into motor unit firings, as used in Formento et al 2021.

EMGDecomp Package for decomposing EMG signals into motor unit firings, created for Formento et al 2021. Based heavily on Negro et al, 2016. Supports G

13 Nov 01, 2022
🧪 Panel-Chemistry - exploratory data analysis and build powerful data and viz tools within the domain of Chemistry using Python and HoloViz Panel.

🧪📈 🐍. The purpose of the panel-chemistry project is to make it really easy for you to do DATA ANALYSIS and build powerful DATA AND VIZ APPLICATIONS within the domain of Chemistry using using Python a

Marc Skov Madsen 97 Dec 08, 2022
An extension to pandas dataframes describe function.

pandas_summary An extension to pandas dataframes describe function. The module contains DataFrameSummary object that extend describe() with: propertie

Mourad 450 Dec 30, 2022