HG002 QC Snakemake
To Run
Resources and data specified within snakefile (hg002QC.smk) for simplicity. Tested with snakemake v6.15.3.
Warning: Several steps of this workflow require minimum coverage. It's recommended that this workflow not be run when yield in base pairs is insufficient to produceat least 15X coverage (i.e. yield/3099922541 >= 15x).
# clone repo
git clone --recursive https://github.com/PacificBiosciences/pb-human-wgs-workflow-snakemake.git workflow
# make necessary directories
mkdir cluster_logs
# create conda environment
conda env create --file workflow/environment.yaml
# activate conda environment
conda activate pb-human-wgs-workflow
# submit job
sbatch workflow/run_hg002QC.sh
Plots
A list of important stats from target files that would be good for plotting.
targets = [f"conditions/{condition}/{filename}"
                    for condition in ubam_dict.keys()
                    for filename in ["smrtcell_stats/all_movies.read_length_and_quality.tsv",
                                    "hifiasm/asm.p_ctg.fasta.stats.txt",
                                    "hifiasm/asm.a_ctg.fasta.stats.txt",
                                    "hifiasm/asm.p_ctg.qv.txt",
                                    "hifiasm/asm.a_ctg.qv.txt",
                                    "truvari/summary.txt",
                                    "pbsv/all_chroms.pbsv.vcf.gz",
                                    "deepvariant/deepvariant.vcf.stats.txt",
                                    "whatshap/deepvariant.phased.tsv",
                                    "happy/all.summary.csv",
                                    "happy/all.extended.csv",
                                    "happy/cmrg.summary.csv",
                                    "happy/cmrg.extended.csv",
                                    "mosdepth/coverage.mosdepth.summary.txt",
                                    "mosdepth/mosdepth.M2_ratio.txt",
                                    "mosdepth/gc_coverage.summary.txt",
                                    "mosdepth/coverage.thresholds.summary.txt"]]
- smrtcell_stats/all_movies.read_length_and_quality.tsv- outputs 3 columns (read name, read length, read quality)
- boxplots of read length and quality
 
- hifiasm/asm.p_ctg.fasta.stats.txt(primary) +- hifiasm/asm.a_ctg.fasta.stats.txt(alternate)- all stats below should be collected for both primary (p_ctg) and alternate (p_atg) assemblies
- assembly size awk '$1=="SZ" {print $2}' <filename>
- auN (area under the curve) awk '$1=="AU" {print $2}' <filename>
- NGx - line plot of NG10 through NG90 awk '$1=="NL" {print $2 $3}' <filename>($2 is x-axis, $3 y-axis) like this: example plot
 
- hifiasm/asm.p_ctg.qv.txt+- hifiasm/asm.a_ctg.qv.txt- adjusted assembly quality awk '$1=="QV" {print $3}' <filename>for primary and alternate assemblies
 
- adjusted assembly quality 
- truvari/truvari.summary.txt- structural variant recall jq .recall <filename>
- structural variant precision jq .precision <filename>
- structural variant f1 jq .f1 <filename>
- number of calls jq '."call cnt"' <filename>
- FP jq .FP <filename>
- TP-call jq .TP-call <filename>
- FN jq .FN <filename>
- TP-base jq .TP-base <filename>
 
- structural variant recall 
- pbsv/all_chroms.pbsv.vcf.gz- counts of each type of variant bcftools query -i 'FILTER=="PASS"' -f '%INFO/SVTYPE\n' <filename> | awk '{A[$1]++}END{for(i in A)print i,A[i]}'
- can also do size distributions of indels bcftools query -i 'FILTER=="PASS" && (INFO/SVTYPE=="INS" | INFO/SVTYPE=="DEL")' -f '%INFO/SVTYPE\t%INFO/SVLEN\n' <filename>
 
- counts of each type of variant 
- deepvariant/deepvariant.vcf.stats.txt- several values in lines starting with 'SN' awk '$1=="SN"' <filename>- number of SNPS
- number INDELs
- number of multi-allelic sites
- number of multi-allelic SNP sites
 
- ratio of transitions to transversions awk '$1=="TSTV" {print$5}' <filename>
- can monitor substitution types awk '$1=="ST"' <filename>
- SNP heterozygous : non-ref homozygous ratio awk '$1=="PSC" {print $6/$5}' <filename>
- SNP transitions : transversions awk '$1=="PSC" {print $7/$8}' <filename>
- Number of heterozygous insertions : number of homozgyous alt insertions awk '$1=="PSI" {print $8/$10}' <filename>
- Number of heterozygous deletions : number of homozgyous alt deletions awk '$1=="PSI" {print $9/$11}' <filename>
- Total INDEL heterozygous:homozygous ratio awk '$1=="PSI" {print ($8+$9)/($10+$11)}' <filename>8+9:10+11 indel het:hom)
 
- several values in lines starting with 'SN' 
- whatshap/deepvariant.phased.tsv- phase block N50 awk '$2=="ALL" {print $22}' <filename>
- bp_per_block_sum (total number of phased bases) awk '$2=="ALL" {print $18}' <filename>
 
- phase block N50 
- whatshap/deepvariant.phased.blocklist- calculate phase block size (to - from) and reverse order them (awk 'NR>1 {print $5-$4}' <filename> |sort -nr), then plot as cumulative line graph like for assembly, N_0 to N90 example plot
 
- calculate phase block size (to - from) and reverse order them (
- happy/all.summary.csv+- happy/cmrg.summary.csv- stats should be collected for allvariants andcmrgchallenging medically relevant genes- SNP recall awk -F, '$1=="SNP" && $2=="PASS" {print $10}' <filename>
- SNP precision awk -F, '$1=="SNP" && $2=="PASS" {print $11}' <filename>
- SNP F1 awk -F, '$1=="SNP" && $2=="PASS" {print $13}' <filename>
- INDEL recall awk -F, '$1=="INDEL" && $2=="PASS" {print $10}' <filename>
- INDEL precision awk -F, '$1=="INDEL" && $2=="PASS" {print $11}' <filename>
- INDEL F1 awk -F, '$1=="INDEL" && $2=="PASS" {print $13}' <filename>
 
- SNP recall 
 
- stats should be collected for 
- happy/all.extended.csv+- happy/cmrg.extended.csv- there are many stratifications that can be examined, and Aaron Wenger might have opinionso n which are most important. The below commands are just for one stratification "GRCh38_lowmappabilityall.bed.gz".
- SNP GRCh38_lowmappabilityall recall awk -F, '$1=="SNP" && $2=="*" && $3=="GRCh38_lowmappabilityall.bed.gz" && $4=="PASS" {print $8}' <filename>
- SNP GRCh38_lowmappabilityall precision awk -F, '$1=="SNP" && $2=="*" && $3=="GRCh38_lowmappabilityall.bed.gz" && $4=="PASS" {print $9}' <filename>
- SNP GRCh38_lowmappabilityall F1 awk -F, '$1=="SNP" && $2=="*" && $3=="GRCh38_lowmappabilityall.bed.gz" && $4=="PASS" {print $11}' <filename>
- INDEL GRCh38_lowmappabilityall recall awk -F, '$1=="INDEL" && $2=="*" && $3=="GRCh38_lowmappabilityall.bed.gz" && $4=="PASS" {print $8}' <filename>
- INDEL GRCh38_lowmappabilityall precision awk -F, '$1=="INDEL" && $2=="*" && $3=="GRCh38_lowmappabilityall.bed.gz" && $4=="PASS" {print $9}' <filename>
- INDEL GRCh38_lowmappabilityall F1 awk -F, '$1=="INDEL" && $2=="*" && $3=="GRCh38_lowmappabilityall.bed.gz" && $4=="PASS" {print $11}' <filename>
 
- mosdepth/coverage.mosdepth.summary.txt- mean aligned coverage in "coverage.mosdepth.summary.txt" - 4th column of final row, can grep 'total_region'
 
- mosdepth/mosdepth.M2_ratio.txt- outputs single value: ratio of chr2 coverage to chrM coverage
- bar chart of m2 ratio
 
- mosdepth/gc_coverage.summary.txt- outputs 5 columns: gc percentage bin, q1 , median , q3 , count
- q1, median, q3 columns are statistics for coverage at different gc percentages (e.g. median cover at 30% GC)
- "count" refers to # of 500 bp windows that fall in that bin
- can pick a couple of key GC coverage bins and make box plots out of them
 
- mosdepth/coverage.thresholds.summary.txt- outputs 10 columns corresponding to % of genome sequenced to minimum coverage depths (1X - 10X)
- maybe a line chart comparing the different coverage thresholds among conditions