Validated, scalable, community developed variant calling, RNA-seq and small RNA analysis

Overview

bcbio banner

Validated, scalable, community developed variant calling, RNA-seq and small RNA analysis. You write a high level configuration file specifying your inputs and analysis parameters. This input drives a parallel run that handles distributed execution, idempotent processing restarts and safe transactional steps. bcbio provides a shared community resource that handles the data processing component of sequencing analysis, providing researchers with more time to focus on the downstream biology.

Build Status Documentation status DOI

Features

Quick start

  1. Install bcbio-nextgen with all tool dependencies and data files:

    wget https://raw.githubusercontent.com/bcbio/bcbio-nextgen/master/scripts/bcbio_nextgen_install.py
    python bcbio_nextgen_install.py /usr/local/share/bcbio --tooldir=/usr/local \
          --genomes hg38 --aligners bwa --aligners bowtie2

    producing an editable system configuration file referencing the installed software, data and system information.

  2. Automatically create a processing description of sample FASTQ and BAM files from your project, and a CSV file of sample metadata:

    bcbio_nextgen.py -w template freebayes-variant project1.csv sample1.bam sample2_1.fq sample2_2.fq

    This produces a sample description file containing pipeline configuration options.

  3. Run analysis, distributed across 8 local cores:

    cd project1/work
    bcbio_nextgen.py ../config/project1.yaml -n 8

Documentation

See the full documentation and longer analysis-based articles. We welcome enhancements or problem reports using GitHub and discussion on the biovalidation mailing list.

Contributors

License

The code is freely available under the MIT license.

Comments
  • torque is hanging indefinitely

    torque is hanging indefinitely

    Hey there -

    In trying the updates for #386 we have killed our development install with 756be0ac - any job we try to run be in human, rat, mouse, or the broken dogs all hang indefinitely with torque. The nodes get checked out and the engine and clients look to be running via qstat or showq - however nothing is happening on the nodes when I look at top or ps aux. There are plenty of free nodes so this doesn't seem to a queue issue The jobs all hang until they hit the timeout and that's all I get. I dont see anything in the logs/ipython logs - Engines appear to have started successfully... I've rubbed my eyes and wiped my work dirs a few times to no avail. I checked and indeed running -t local works.... Any suggestions or additional info I can provide?

    Thanks!

    opened by caddymob 67
  • Scalpel InDel calling support

    Scalpel InDel calling support

    Looks like vcf support has been added to Scalpel recently: http://sourceforge.net/p/scalpel/code/ci/master/tree/

    Opening this ticket while I'm looking into testing Scalpel and integrating it within bcbio, bear with me

    opened by mjafin 65
  • Problems with logs and joint VCF file generation in latest dev build

    Problems with logs and joint VCF file generation in latest dev build

    Hello,

    After upgrading to the latest development version, the logs and joint VCF file generation don't seem to work properly anymore. Debug messages don't get printed anymore (neither on stdout, nor in the log file), and the bcbio-nextgen-debug.log file is pretty much identical with bcbio-nextgen.log. The only difference is the resource requests messages which appear in the debug log:

    [2018-08-02T10:10Z] Resource requests: bwa, sambamba, samtools; memory: 3.00, 3.00, 3.00; cores: 16, 16, 16
    [2018-08-02T10:10Z] Configuring 2 jobs to run, using 16 cores each with 48.1g of memory reserved for each job
    
    [2018-08-02T10:10Z] Resource requests: gatk, gatk-haplotype, picard; memory: 3.50, 3.00, 3.00; cores: 1, 16, 16
    [2018-08-02T10:10Z] Configuring 32 jobs to run, using 1 cores each with 3.50g of memory reserved for each job
    
    [2018-08-02T10:18Z] Resource requests: bcbio_variation, fastqc, gatk, gatk-vqsr, gemini, kraken, preseq, qsignature, sambamba, samtools; memory: 3.00, 3.00, 3.50, 3.00, 3.00, 3.00, 3.00, 3.00, 3.00, 3.00; cores: 16, 16, 1, 16, 16, 16, 16, 16, 16, 16
    [2018-08-02T10:18Z] Configuring 2 jobs to run, using 16 cores each with 56.1g of memory reserved for each job
    

    The multi-sample <batch>-gatk-haplotype-joint-annotated.vcf.gz did not get generated, even though the sample-specific VCF files are where they should be.

    Furthermore, bcbio-nextgen-commands.log is completely empty.

    To test all of this, I've run a simple variant calling job that worked flawlessly a few days ago, before upgrading Bcbio-nextgen.

    opened by amizeranschi 60
  • RFC / RFE: LOH analysis in tumor-normal samples

    RFC / RFE: LOH analysis in tumor-normal samples

    GIven the interest in studies that involve tumor heterogeneity / subclonality, currently bcbio offers "out of the box" support for both somatic variants and CNVs. A useful metric that can be combined (and already used by some tools, like CNVkit's plotting) is LOH, which (to my knowledge) is not yet handled.

    I admit I'm not sure if there is support already for this in bcbio. I know that back in the days I baked VarScan support to actually remove LOH calls from the VCF as they weren't truly somatic calls.

    The biggest problem here is how to actually and reliably extract these information. MuTect[2] might have these in the REJECTed calls (but how to distinguish them?), VarScan 2 calls them (might just be needed to move them away and elsewhere) and I'm not sure how FreeBayes and VarDict handle them.

    Or are there any other tools more suited for this purpose?

    I'm willing to put the money where my mouth is in this case as we're starting to explore this in my institution and having bcbio do that would greatly streamline things.

    opened by lbeltrame 59
  • Trio pipeline

    Trio pipeline

    @chapmanb

    1. I would like to run a trio analysis in whole exome samples. Can I use all callers (strelka2, deepvariant. vardict, gatk etc) for a trio analysis with samples having the same batch name? Can I use the ensemble method?

    2. I am also trying to do CNV analysis in this trio. Can I add all svcallers? Do all work with single germline sample?

    It would also be nice to specify in the documentation:

    Which callers can be used for Germline Variant Calling Which callers can only be used for Somatic (Tumor-Normal) Variant Calling Which callers can be used for Germline SV Calling Which callers can only be used for Somatic (Tumor-Normal) SV Calling Which callers can be user for Trio analysis

    opened by kokyriakidis 56
  • canfam3 dbSNP - ensembl 75

    canfam3 dbSNP - ensembl 75

    greetings! Can we add the canine dbSNP vcf to the variation resources in 9dcb447, please? I realize recallibration will not be available but getting rsIDs sure would be nice :)

    The vcf can be obtained here: ftp://ftp.ensembl.org/pub/release-75/variation/vcf/canis_familiaris/Canis_familiaris.vcf.gz

    Only thing is the canine genome for bcbio has "chr" prefixes on contigs where the dbSNP does not... I seem to recal you have a ensembl <--> ucsc conversion method from when we added the rn5 genome, so hoping this is easy without just awk'in on a 'chr' :)

    Thanks!

    opened by caddymob 56
  • Incorrect CNVkit output

    Incorrect CNVkit output

    I’ve used the cnvkit a few times, but this particular sample results in stating everything is at a loss.

    This is the head T1.cns file produced by bcbio (i removed the gene column for clarity)

    chromosome | start | end | gene | log2 | baf | depth | probes | weight
    -- | -- | -- | -- | -- | -- | -- | -- | --
    chr1 | 10044 | 3783855 | removed | -1.86927 | 0.402385 | 3.47617 | 3006 | 404.754
    chr1 | 3786057 | 12808529 | removed | -2.93556 | 0.446113 | 1.56538 | 8470 | 1231.48
    chr1 | 12810479 | 14874986 | removed | -4.6436 | 0.433824 | 0.933951 | 1538 | 225.929
    chr1 | 14876201 | 16524732 | removed | -2.55335 | 0.439711 | 1.89716 | 1584 | 229.988
    chr1 | 16525961 | 16962159 | removed | -0.14769 | 0.263566 | 4.08855 | 297 | 42.2095
    chr1 | 16962525 | 46822896 | removed | -3.13869 | 0.444444 | 1.5238 | 26862 | 3916.87
    chr1 | 46824333 | 51700063 | removed | -4.49549 | 0.449153 | 0.982765 | 3712 | 547.092
    

    note that all of the log2 values are quite negative (-call.cns is similar)

    and this is the result of running cnvkit manually:

    chromosome | start | end | gene | log2 | depth | probes | weight
    -- | -- | -- | -- | -- | -- | -- | --
    chr1 | 65409 | 7106529 | X | -0.06439 | 105.197 | 1521 | 522.478
    chr1 | 7107029 | 1.22E+08 | X |   |   |   |  
    chr1 | 1.22E+08 | 1.25E+08 | X | 1.19374 | 3.32145 | 16 | 7.01406
    chr1 | 1.44E+08 | 1.52E+08 | X |   |   |   |  
    chr1 | 1.52E+08 | 1.52E+08 | X | 0.595503 | 239.102 | 74 | 21.886
    chr1 | 1.52E+08 | 2.48E+08 | X |   |   |   |  
    chr1 | 2.48E+08 | 2.49E+08 | X | 0.304078 | 156.569 | 184 | 63.7607
    chr2 | 41359 | 93085490 | X |   |   |   |  
    chr2 | 94573375 | 1.79E+08 | X |   |   |   |  
    chr2 | 1.79E+08 | 1.79E+08 | X | 0.168907 | 216.203 | 540 | 173.011
    chr2 | 1.79E+08 | 2E+08 | X | 0.031598 | 92.5111 | 1717 | 642.207
    

    while some of the columns are 0, the results are much more close to accurate

    this is the manual command, which i don't think is particularly unique and uses the bcbio generated bam files

    cnvkit.py batch final/T1/T1-ready.bam --normal final/N1/N1-ready.bam -p 8 --targets ../S04380110_Padded_hg38_trimmed.bed --fasta /mnt/biodata/genomes/Hsapiens/hg38/seq/hg38.fa --output-dir ./cnvkit/ --diagram –scatter

    opened by choosehappy 52
  • Using UMIs in the bcbio smallRNA pipeline

    Using UMIs in the bcbio smallRNA pipeline

    Hi,

    This is somewhat similar to #2070. We have sing end .fastq files with the following format:

    @NB500965:105:HC5J5BGX2:1:11108:16467:3587 1:N:0:ATCACG TTCAAGTAATCCAGGATAGGAACTGTAGGCACCATCAATGACACCGAACGTAGATCGGAAAGCACACGTCTGAACT + AAAAAEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEAEEEEEEEEEEEEEEEAAEEE/EE

    where the bolded ATCACG = unique sample index and the bolded AACTGTAGGCACCATCAAT = 3' adapter

    Following the 3' adapter is a 12 nt UMI. If I massage the .fastq file such that they are in the format:

    @NB500965:105:HC5J5BGX2:1:11108:16467:3587 1:N:0:ATCACG:UMI_GACACCGAACGTAGA
    TTCAAGTAATCCAGGATAGGAACTGTAGGCACCATCAATGACACCGAACGTAGATCGGAAAGCACACGTCTGAACT
    +
    AAAAAEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEAEEEEEEEEEEEEEEEAAEEE/EE
    

    am I then able to add umi_type: fastq_name to the bcbio .yaml config and run through the small RNA pipeline? Is there a better way of doing this?

    All advice gratefully received.

    opened by mxhp75 51
  • AssertionErrror multisample joint calling

    AssertionErrror multisample joint calling

    Hi Brad and group,

    Recently we did a run bcbio on multisample joint calling and its failing. When we do single sample joint calling it works. Attached are the sample yaml and bcbio error files. Its complaining about coordinates, but I am not sure how did it work for single sample.

    Attached are the bcbio err file and sample file.

    Thanks,

    bcbio.stderr.txt sample.yaml.txt

    opened by DiyaVaka 51
  • RFC: allele fraction thresholds for paired analyses

    RFC: allele fraction thresholds for paired analyses

    MuTect and VarScan has a threshold setting (--tumor_f_pretest) to select sites with at least a certain fraction of non-REF alleles. Something similar is in VarScan (minimum frequency to call an allele as heterozygote). MuTect has no preset, VarScan has 0.1 by default.

    I'm wondering if (hence the RFC) this could be handled in the algorithm parameters, or at least harmonized between the two callers. Selecting a proper "frequency" (quotes, because you can't really call it frequency when you have just a sample pair) is important for validation.

    Opinions? Pro, contra?

    discussion 
    opened by lbeltrame 49
  • error in bcbio structural variant calling

    error in bcbio structural variant calling

    Hi Brad,

    Thanks for your help. I want to call structural variants, but get an error: the parallel, svtyper, cnvnator_wrapper.py, cnvnator-multi, annotate_rd.py are not found in PATH, like this:

    [2014-10-27 23:05] Uncaught exception occurred Traceback (most recent call last): File "/public/software/bcbio-nextgen/anaconda/lib/python2.7/site-packages/bcbio/provenance/do.py", line 20, in run _do_run(cmd, checks, log_stdout) File "/public/software/bcbio-nextgen/anaconda/lib/python2.7/site-packages/bcbio/provenance/do.py", line 93, in _do_run raise subprocess.CalledProcessError(exitcode, error_msg) CalledProcessError: Command 'set -o pipefail; speedseq sv -v -B ...... Sourcing executables from /public/software/bcbio-nextgen/tools/bin/speedseq.config ... which: no parallel in (/public/software/bcbio-nextgen/tools/bin:/public/software/bcbio-nextgen/anaconda/bin:.....) which: no svtyper in (/public/software/bcbio-nextgen/tools/bin:/public/software/bcbio.... which: no cnvnator_wrapper.py in (/public/software/bcbio-nextgen/tools/bin:/public/software/bcbio.... which: no cnvnator-multi in (/public/software/bcbio-nextgen/tools/bin:/public/software/bcbio-.... which: no annotate_rd.py in((/public/software/bcbio-nextgen/tools/bin:/....) Calculating alignment stats... sambamba-view: (Broken pipe) Traceback (most recent call last): File "/public/software/bcbio-nextgen/tools/share/lumpy-sv/pairend_distro.py", line 12, in import numpy as np ImportError: No module named numpy

    How can I fix this, thanks again.

    Shangqian

    opened by shang-qian 47
  • ValueError: Could not find directory in config for snpeff

    ValueError: Could not find directory in config for snpeff

    Version info

    • bcbio version (bcbio_nextgen.py --version): 1.2.5
    • OS name and version (lsb_release -ds): "CentOS Linux release 7.9.2009 (Core)"

    To Reproduce Exact bcbio command you have used:

    bcbio_nextgen.py ${yaml} -n 500 -t ipython -s slurm -q batch -r "t=4-00:00:00"  --timeout 4000 --retries 500 
    
    

    Your yaml configuration file:

    resources:
      bwa:
         cores: 8
         memory: 3.5G
      samtools:
         cores: 4
         memory: 3.5G
      gatk:
         jvm_opts: ['-Xms6g' , '-Xmx6g']
         memory: 16G
    
    details:
    - algorithm:
        adapters:
        - polya
        aligner: bwa
        jointcaller: gatk-haplotype-joint
        mark_duplicates: true
        realign: false
        save_diskspace: true
        trim_reads: read_through
        variantcaller: gatk-haplotype
        vcfanno: gemini
      analysis: variant2
      description: 10G
      files:
      - 10G_R1.fastq.gz
      - 10G_R2.fastq.gz
      genome_build: hg38
      metadata:
        batch: ksu
        sex: male
    - algorithm:
        adapters:
        - polya
        aligner: bwa
        jointcaller: gatk-haplotype-joint
        mark_duplicates: true
        realign: false
        save_diskspace: true
        trim_reads: read_through
        variantcaller: gatk-haplotype
        vcfanno: gemini
      analysis: variant2
      description: 10F
      files:
      - 10F_R1.fastq.gz
      - 10F_R2.fastq.gz
      genome_build: hg38
      metadata:
        batch: ksu
        sex: male 
    fc_name: '28'
    upload:
      dir: ../final
    

    Log files (could be found in work/log) Please attach (10MB max): bcbio-nextgen-commands.log, and bcbio-nextgen-debug.log. It works perfectly, but at the final annotation I get the following error:

    2023-01-03 08:01:35.623 [IPClusterStart] Loaded config file: /encrypted/e3008/Azza/ksu_bcbio_fam/families/28/work/log/ipython/ipcluster_config.py
    2023-01-03 08:01:35.623 [IPClusterStart] Looking for ipcluster_config in /encrypted/e3008/Azza/ksu_bcbio_fam/families/28/work
    [2023-01-03T05:02Z] cn605-27-r: Timing: variant post-processing
    [2023-01-03T05:02Z] cn605-27-r: ipython: postprocess_variants
    [2023-01-03T05:02Z] cn514-09-l: Finalizing variant calls: 10G, gatk-haplotype
    [2023-01-03T05:02Z] cn514-09-l: Calculating variation effects for 10G, gatk-haplotype
    [2023-01-03T05:02Z] cn514-09-l: Unexpected error
    Traceback (most recent call last):
      File "/ibex/sw/csi/bcbio-nextgen/1.2.5/el7.9_python2/bcbio/anaconda/lib/python3.6/site-packages/bcbio/distributed/ipythontasks.py", line 54, in _setup_logging
        yield config
      File "/ibex/sw/csi/bcbio-nextgen/1.2.5/el7.9_python2/bcbio/anaconda/lib/python3.6/site-packages/bcbio/distributed/ipythontasks.py", line 360, in postprocess_variants
        return ipython.zip_args(apply(variation.postprocess_variants, *args))
      File "/ibex/sw/csi/bcbio-nextgen/1.2.5/el7.9_python2/bcbio/anaconda/lib/python3.6/site-packages/bcbio/distributed/ipythontasks.py", line 82, in apply
        return object(*args, **kwargs)
      File "/ibex/sw/csi/bcbio-nextgen/1.2.5/el7.9_python2/bcbio/anaconda/lib/python3.6/site-packages/bcbio/pipeline/variation.py", line 97, in postprocess_variants
        ann_vrn_file, vrn_stats = effects.add_to_vcf(data[vrn_key], data)
      File "/ibex/sw/csi/bcbio-nextgen/1.2.5/el7.9_python2/bcbio/anaconda/lib/python3.6/site-packages/bcbio/variation/effects.py", line 32, in add_to_vcf
        ann_vrn_file, stats_files = snpeff_effects(in_file, data)
      File "/ibex/sw/csi/bcbio-nextgen/1.2.5/el7.9_python2/bcbio/anaconda/lib/python3.6/site-packages/bcbio/variation/effects.py", line 298, in snpeff_effects
        return _run_snpeff(vcf_in, "vcf", data)
      File "/ibex/sw/csi/bcbio-nextgen/1.2.5/el7.9_python2/bcbio/anaconda/lib/python3.6/site-packages/bcbio/variation/effects.py", line 399, in _run_snpeff
        snpeff_db, datadir = get_db(data)
      File "/ibex/sw/csi/bcbio-nextgen/1.2.5/el7.9_python2/bcbio/anaconda/lib/python3.6/site-packages/bcbio/variation/effects.py", line 353, in get_db
        snpeff_base_dir, snpeff_db = _installed_snpeff_genome(snpeff_db, data["config"])
      File "/ibex/sw/csi/bcbio-nextgen/1.2.5/el7.9_python2/bcbio/anaconda/lib/python3.6/site-packages/bcbio/variation/effects.py", line 439, in _installed_snpeff_genome
        snpeff_config_file = os.path.join(config_utils.get_program("snpeff", config, "dir"),
      File "/ibex/sw/csi/bcbio-nextgen/1.2.5/el7.9_python2/bcbio/anaconda/lib/python3.6/site-packages/bcbio/pipeline/config_utils.py", line 193, in get_program
        return _get_program_dir(name, pconfig)
      File "/ibex/sw/csi/bcbio-nextgen/1.2.5/el7.9_python2/bcbio/anaconda/lib/python3.6/site-packages/bcbio/pipeline/config_utils.py", line 249, in _get_program_dir
        raise ValueError("Could not find directory in config for %s" % name)
    ValueError: Could not find directory in config for snpeff
    
    

    Although the snpeff database is there in the bcbio directory

    ls ./bcbio/genomes/Hsapiens/hg38/
    bwa  config  coverage  editing  rnaseq  rtg  seq  snpeff  srnaseq  star  txtmp  validation  variation  vep  versions.csv  viral
    
    

    Thank you!

    opened by azzatha 0
  • recalibrate=true fails, Unsupported class file major version 55

    recalibrate=true fails, Unsupported class file major version 55

    Version info

    • bcbio version: 1.2.9
    • OS name and version: Ubuntu 18.04.5 LTS

    To Reproduce Exact bcbio command you have used:

    bcbio_nextgen.py ../config/config.yaml -n 8
    

    Your yaml configuration file:

    details:
    - algorithm:
        aligner: bwa
        exclude_regions: [lcr]
        mark_duplicates: true
        recalibrate: true
        variantcaller: [mutect2, strelka2, varscan, vardict]
        variant_regions: /media/gpudrive/apps/bcbio/genomes/Hsapiens/GRCh37/coverage/capture_regions/Exome-NGv3.bed
      analysis: variant2
      description: Patient70-normal
      files:
        - normal_1.fq.gz
        - normal_2.fq.gz
      genome_build: GRCh37
      metadata:
        batch: Patient70
        phenotype: normal
    - algorithm:
        aligner: bwa
        mark_duplicates: true
        recalibrate: true
        remove_lcr: true
        variantcaller: [mutect2, strelka2, varscan, vardict]
        variant_regions: /media/gpudrive/apps/bcbio/genomes/Hsapiens/GRCh37/coverage/capture_regions/Exome-NGv3.bed
      analysis: variant2
      description: Patient70-tumor
      files:
        - tumor_1.fq.gz
        - tumor_2.fq.gz
      genome_build: GRCh37
      metadata:
        batch: Patient70
        phenotype: tumor
    upload:
        dir: ../final
    

    Log files (could be found in work/log) Here are the important parts of the log I guess

    [2022-12-25T18:24Z] GATK: BaseRecalibratorSpark
    [2022-12-25T18:25Z] 18:25:59.390 INFO  BaseRecalibratorSpark - ------------------------------------------------------------
    [2022-12-25T18:25Z] 18:25:59.391 INFO  BaseRecalibratorSpark - The Genome Analysis Toolkit (GATK) v4.2.6.1
    [2022-12-25T18:25Z] 18:25:59.391 INFO  BaseRecalibratorSpark - For support and documentation go to https://software.broadinstitute.org/gatk/
    [2022-12-25T18:25Z] 18:25:59.391 INFO  BaseRecalibratorSpark - Executing as [email protected] on Linux v4.15.0-197-generic amd64
    [2022-12-25T18:25Z] 18:25:59.391 INFO  BaseRecalibratorSpark - Java runtime: OpenJDK 64-Bit Server VM v11.0.9.1-internal+0-adhoc..src
    [2022-12-25T18:25Z] 18:25:59.392 INFO  BaseRecalibratorSpark - Start Date/Time: December 25, 2022 at 6:25:03 PM UTC
    [2022-12-25T18:25Z] 18:25:59.392 INFO  BaseRecalibratorSpark - ------------------------------------------------------------
    [2022-12-25T18:25Z] 18:25:59.392 INFO  BaseRecalibratorSpark - ------------------------------------------------------------
    [2022-12-25T18:25Z] 18:25:59.393 INFO  BaseRecalibratorSpark - HTSJDK Version: 2.24.1
    [2022-12-25T18:25Z] 18:25:59.393 INFO  BaseRecalibratorSpark - Picard Version: 2.27.1
    [2022-12-25T18:25Z] 18:25:59.393 INFO  BaseRecalibratorSpark - Built for Spark Version: 2.4.5
    
    ...
    [2022-12-25T18:36Z] java.lang.IllegalArgumentException: Unsupported class file major version 55
    [2022-12-25T18:36Z]     at org.apache.xbean.asm6.ClassReader.<init>(ClassReader.java:166)
    [2022-12-25T18:36Z]     at org.apache.xbean.asm6.ClassReader.<init>(ClassReader.java:148)
    [2022-12-25T18:36Z]     at org.apache.xbean.asm6.ClassReader.<init>(ClassReader.java:136)
    [2022-12-25T18:36Z]     at org.apache.xbean.asm6.ClassReader.<init>(ClassReader.java:237)
    [2022-12-25T18:36Z]     at org.apache.spark.util.ClosureCleaner$.getClassReader(ClosureCleaner.scala:49)
    [2022-12-25T18:36Z]     at org.apache.spark.util.FieldAccessFinder$$anon$3$$anonfun$visitMethodInsn$2.apply(ClosureCleaner.scala:517)
    [2022-12-25T18:36Z]     at org.apache.spark.util.FieldAccessFinder$$anon$3$$anonfun$visitMethodInsn$2.apply(ClosureCleaner.scala:500)
    [2022-12-25T18:36Z]     at scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:733)
    [2022-12-25T18:36Z]     at scala.collection.mutable.HashMap$$anon$1$$anonfun$foreach$2.apply(HashMap.scala:134)
    [2022-12-25T18:36Z]     at scala.collection.mutable.HashMap$$anon$1$$anonfun$foreach$2.apply(HashMap.scala:134)
    [2022-12-25T18:36Z]     at scala.collection.mutable.HashTable$class.foreachEntry(HashTable.scala:236)
    [2022-12-25T18:36Z]     at scala.collection.mutable.HashMap.foreachEntry(HashMap.scala:40)
    [2022-12-25T18:36Z]     at scala.collection.mutable.HashMap$$anon$1.foreach(HashMap.scala:134)
    [2022-12-25T18:36Z]     at scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:732)
    [2022-12-25T18:36Z]     at org.apache.spark.util.FieldAccessFinder$$anon$3.visitMethodInsn(ClosureCleaner.scala:500)
    [2022-12-25T18:36Z]     at org.apache.xbean.asm6.ClassReader.readCode(ClassReader.java:2175)
    [2022-12-25T18:36Z]     at org.apache.xbean.asm6.ClassReader.readMethod(ClassReader.java:1238)
    [2022-12-25T18:36Z]     at org.apache.xbean.asm6.ClassReader.accept(ClassReader.java:631)
    [2022-12-25T18:36Z]     at org.apache.xbean.asm6.ClassReader.accept(ClassReader.java:355)
    [2022-12-25T18:36Z]     at org.apache.spark.util.ClosureCleaner$$anonfun$org$apache$spark$util$ClosureCleaner$$clean$14.apply(ClosureCleaner.scala:307)
    [2022-12-25T18:36Z]     at org.apache.spark.util.ClosureCleaner$$anonfun$org$apache$spark$util$ClosureCleaner$$clean$14.apply(ClosureCleaner.scala:306)
    [2022-12-25T18:36Z]     at scala.collection.immutable.List.foreach(List.scala:392)
    [2022-12-25T18:36Z]     at org.apache.spark.util.ClosureCleaner$.org$apache$spark$util$ClosureCleaner$$clean(ClosureCleaner.scala:306)
    [2022-12-25T18:36Z]     at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:162)
    [2022-12-25T18:36Z]     at org.apache.spark.SparkContext.clean(SparkContext.scala:2326)
    [2022-12-25T18:36Z]     at org.apache.spark.SparkContext.runJob(SparkContext.scala:2100)
    [2022-12-25T18:36Z]     at org.apache.spark.SparkContext.runJob(SparkContext.scala:2126)
    [2022-12-25T18:36Z]     at org.apache.spark.rdd.RDD$$anonfun$collect$1.apply(RDD.scala:990)
    [2022-12-25T18:36Z]     at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
    [2022-12-25T18:36Z]     at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
    [2022-12-25T18:36Z]     at org.apache.spark.rdd.RDD.withScope(RDD.scala:385)
    [2022-12-25T18:36Z]     at org.apache.spark.rdd.RDD.collect(RDD.scala:989)
    [2022-12-25T18:36Z]     at org.apache.spark.RangePartitioner$.sketch(Partitioner.scala:309)
    [2022-12-25T18:36Z]     at org.apache.spark.RangePartitioner.<init>(Partitioner.scala:171)
    [2022-12-25T18:36Z]     at org.apache.spark.RangePartitioner.<init>(Partitioner.scala:151)
    [2022-12-25T18:36Z]     at org.apache.spark.rdd.OrderedRDDFunctions$$anonfun$sortByKey$1.apply(OrderedRDDFunctions.scala:62)
    [2022-12-25T18:36Z]     at org.apache.spark.rdd.OrderedRDDFunctions$$anonfun$sortByKey$1.apply(OrderedRDDFunctions.scala:61)
    [2022-12-25T18:36Z]     at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
    [2022-12-25T18:36Z]     at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
    [2022-12-25T18:36Z]     at org.apache.spark.rdd.RDD.withScope(RDD.scala:385)
    [2022-12-25T18:36Z]     at org.apache.spark.rdd.OrderedRDDFunctions.sortByKey(OrderedRDDFunctions.scala:61)
    [2022-12-25T18:36Z]     at org.apache.spark.api.java.JavaPairRDD.sortByKey(JavaPairRDD.scala:936)
    [2022-12-25T18:36Z]     at org.broadinstitute.hellbender.utils.spark.SparkUtils.sortUsingElementsAsKeys(SparkUtils.java:165)
    [2022-12-25T18:36Z]     at org.broadinstitute.hellbender.engine.spark.datasources.ReadsSparkSink.sortSamRecordsToMatchHeader(ReadsSparkSink.java:207)
    [2022-12-25T18:36Z]     at org.broadinstitute.hellbender.engine.spark.datasources.ReadsSparkSink.writeReads(ReadsSparkSink.java:107)
    [2022-12-25T18:36Z]     at org.broadinstitute.hellbender.engine.spark.GATKSparkTool.writeReads(GATKSparkTool.java:374)
    [2022-12-25T18:36Z]     at org.broadinstitute.hellbender.engine.spark.GATKSparkTool.writeReads(GATKSparkTool.java:362)
    [2022-12-25T18:36Z]     at org.broadinstitute.hellbender.tools.spark.ApplyBQSRSpark.runTool(ApplyBQSRSpark.java:90)
    [2022-12-25T18:36Z]     at org.broadinstitute.hellbender.engine.spark.GATKSparkTool.runPipeline(GATKSparkTool.java:546)
    [2022-12-25T18:36Z]     at org.broadinstitute.hellbender.engine.spark.SparkCommandLineProgram.doWork(SparkCommandLineProgram.java:31)
    [2022-12-25T18:36Z]     at org.broadinstitute.hellbender.cmdline.CommandLineProgram.runTool(CommandLineProgram.java:140)
    [2022-12-25T18:36Z]     at org.broadinstitute.hellbender.cmdline.CommandLineProgram.instanceMainPostParseArgs(CommandLineProgram.java:192)
    [2022-12-25T18:36Z]     at org.broadinstitute.hellbender.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:211)
    [2022-12-25T18:36Z]     at org.broadinstitute.hellbender.Main.runCommandLineProgram(Main.java:160)
    [2022-12-25T18:36Z]     at org.broadinstitute.hellbender.Main.mainEntry(Main.java:203)
    [2022-12-25T18:36Z]     at org.broadinstitute.hellbender.Main.main(Main.java:289)
    [2022-12-25T18:36Z] 22/12/25 18:36:31 INFO ShutdownHookManager: Shutdown hook called
    Using GATK jar /pathto/bcbio/anaconda/envs/java/share/gatk4-4.2.6.1-1/gatk-package-4.2.6.1-local.jar
    
    opened by asalimih 0
  • Add --cloudbiolinux argument

    Add --cloudbiolinux argument

    Fixes the following issue: https://github.com/bcbio/bcbio-nextgen/issues/3689

    The problem originated from this commit: https://github.com/bcbio/bcbio-nextgen/commit/d61e77825f46548101db9b64776269f8e96ee220

    opened by amizeranschi 0
  • [main_samview] fail to read the header from

    [main_samview] fail to read the header from "filename.sam".

    Hello, I am getting the following error when trying to run samtools in a sam file:

    [main_samview] fail to read the header from "20201032.sam". srun: error: node2-092: task 0: Exited with exit code 1

    But when i checked the sam file (using head) it does contain the headers, so can be happening?
    @SQ SN:1 LN:278617202 @SQ SN:2 LN:250202058 @SQ SN:3 LN:226089100

    my script is as follow:

    #!/bin/bash

    #SBATCH --job-name=samtools #SBATCH --time=72:00:00 #SBATCH --partition=serial #SBATCH --ntasks=1 #SBATCH --mem-per-cpu=100GB #SBATCH [email protected] #SBATCH --mail-type=fail,end #SBATCH --error=%u.%J.err #SBATCH --output=%u.%J.out

    load all modules needed for the current run

    module purge # clean the current env module add slurm # we always need this one

    Activate the environment

    module add TOOLS python/miniconda-3.9 module add bio/samtools/1.16.1/gcc/9.2.0 source activate ngs-tools

    echo "Starting at date" echo "Running on hosts: $SLURM_NODELIST" echo "Current working directory is pwd"

    srun samtools view -bh 20201032.sam > SRR519926.bam
    samtools sort 20201032.bam > SRR519926.sorted.bam
    samtools index 20201032.sorted.bam

    Save results and final clean up

    source deactivate

    echo "Finished at date"

    opened by gabyrudd22 0
  • Error with bcbio_setup_genome.py: AttributeError: 'Namespace' object has no attribute 'cloudbiolinux'

    Error with bcbio_setup_genome.py: AttributeError: 'Namespace' object has no attribute 'cloudbiolinux'

    Hi,

    I'm getting an error when trying to create a custom genome. Here's the command I'm running and the error it produces:

    $ bcbio_setup_genome.py -f GWHBDNW00000000.genome.fasta -g GWHBDNW00000000.gff --gff3 -i bwa seq -n GWHBDNW00000000 -b build1 --buildversion None
    Traceback (most recent call last):
      File "/data/share/bcbio_nextgen/anaconda/bin/bcbio_setup_genome.py", line 249, in <module>
        cbl = get_cloudbiolinux(args, REMOTES)
      File "/data/share/bcbio_nextgen/anaconda/lib/python3.7/site-packages/bcbio/install.py", line 807, in get_cloudbiolinux
        cloudbiolinux_remote = remotes["cloudbiolinux"] % args.cloudbiolinux
    AttributeError: 'Namespace' object has no attribute 'cloudbiolinux'
    
    
    opened by amizeranschi 0
  • Bringing back Docker support, possibly as a replacement for the various Conda environments

    Bringing back Docker support, possibly as a replacement for the various Conda environments

    Inspired by a recent comment from @gabeng, I wanted to ask if it would be a great deal of effort to bring back Docker support and the creation of new Bcbio Docker images.

    One alternative to reviving bcbio-nextgen-vm (although perhaps more laborious) could be to have the possibility to replace Conda environments with several Docker containers in bcbio-nextgen itself, as they do for example in nf-core/sarek. Given how often Conda has been breaking bcbio installs during the last couple of years, it could be worth the effort to replace it, or at least offer the possibility of using Docker containers as an alternative. And this could also pave the way for Kubernetes support at some point.

    Here's a list of the Docker images currently on my system, after a few variant calling experiments with the above pipeline:

    $ docker image ls
    REPOSITORY                                                                 TAG                                          IMAGE ID       CREATED         SIZE
    nfcore/snpeff                                                              5.1.R64-1-1                                  0462080aa43c   2 weeks ago     1.4GB
    nfcore/vep                                                                 106.1.R64-1-1                                e5c98f96ae89   2 weeks ago     1.22GB
    quay.io/biocontainers/mulled-v2-d9e7bad0f7fbc8f4458d5c3ab7ffaaf0235b59fb   551156018e5580fb94d44632dfafbc9c27005a0e-0   5703dbdd3100   2 weeks ago     1.01GB
    quay.io/biocontainers/mulled-v2-780d630a9bb6a0ff2e7b6f730906fd703e40e98f   3bdd798e4b9aed6d3e1aaa1596c913a3eeb865cb-0   c4f4a546ff1b   3 weeks ago     1.26GB
    quay.io/biocontainers/mulled-v2-fe8faa35dbf6dc65a0f7f5d4ea12e31a79f73e40   219b6c272b25e7e642ae3ff0bf0c5c81a5135ab4-0   a3d569a08aa5   3 weeks ago     133MB
    quay.io/biocontainers/gatk4                                                4.3.0.0--py36hdfd78af_0                      0f8cc7afc8e6   7 weeks ago     966MB
    quay.io/biocontainers/bcftools                                             1.16--hfe4b78e_1                             7ec55dde74af   8 weeks ago     198MB
    quay.io/biocontainers/samtools                                             1.16.1--h6899075_1                           09cd4486af55   8 weeks ago     62MB
    quay.io/biocontainers/freebayes                                            1.3.6--hbfe0e7f_2                            9c664cb1521f   2 months ago    326MB
    quay.io/biocontainers/tiddit                                               3.3.2--py310hc2b7f4b_0                       e9c7cf6b37d7   2 months ago    350MB
    quay.io/biocontainers/multiqc                                              1.13--pyhdfd78af_0                           747595fd0a8e   2 months ago    431MB
    google/deepvariant                                                         1.4.0                                        decb60cd33cb   6 months ago    5.72GB
    quay.io/biocontainers/sra-tools                                            2.11.0--pl5321ha49a11a_3                     58aa27074b50   9 months ago    379MB
    quay.io/biocontainers/mosdepth                                             0.3.3--hdfd78af_1                            14b81386a558   10 months ago   22.5MB
    quay.io/biocontainers/fastp                                                0.23.2--h79da9fb_0                           371123966d85   12 months ago   52MB
    quay.io/biocontainers/mulled-v2-5f89fe0cd045cb1d615630b9261a1d17943a9b6a   6a9ff0e76ec016c3d0d27e0c0d362339f2d787e6-0   8bb307eced25   14 months ago   387MB
    quay.io/biocontainers/python                                               3.9--1                                       34c2b9e3810c   17 months ago   191MB
    quay.io/biocontainers/cnvkit                                               0.9.9--pyhdfd78af_0                          65c84d95fbda   18 months ago   1.12GB
    quay.io/biocontainers/tabix                                                1.11--hdfd78af_0                             171149a492ea   19 months ago   94.3MB
    quay.io/biocontainers/manta                                                1.6.0--h9ee0642_1                            0be19048fb6e   20 months ago   200MB
    quay.io/bcbio/bcbio-vc                                                     latest                                       196407441ba3   23 months ago   5.89GB
    quay.io/biocontainers/gawk                                                 5.1.0                                        1f25a9f620a3   2 years ago     38.6MB
    quay.io/biocontainers/vcftools                                             0.1.16--he513fc3_4                           edbf7b8881c0   2 years ago     48MB
    quay.io/biocontainers/fastqc                                               0.11.9--0                                    9d444341a7b2   2 years ago     531MB
    quay.io/biocontainers/bwa                                                  0.7.17--hed695b0_7                           5c6028c4ea33   2 years ago     109MB
    
    opened by amizeranschi 0
Releases(v1.2.9)
  • v1.2.9(Dec 15, 2021)

    • Fix vcf header bug: T/N SAMPLE lines are back - needed for import to SolveBio
    • add strandedness: auto for -l A option in salmon
    • report 10x more peaks in CHIP/ATAC-seq - use 0.05 qvalue
    • fix misleading RNA-seq duplicated reads statistics: thanks @sib-bcf
    • reorganize conda environments
    • snpEff 5.0
    • strandedness: auto
    • document WGBS pipeline steps
    • make --local an option, not default in bismark alignment - too slow
    • bcbioRNASeq update to 0.3.44
    • pureCN update to 2.0.1
    • octopus update to 0.7.4
    Source code(tar.gz)
    Source code(zip)
  • v1.2.8(Apr 14, 2021)

    • Set ENCODE library complexity flags properly for ChIP-seq. Thanks to @mistrm82.
    • Fix greylisted peaks not being propagated to the output directory. Thanks to @mistrm82.
    • Better error message when no sample barcodes are found for single-cell RNA-seq.
    • Better trimming for 2 wgbs kits
    • enable setting parameters for deduplicate_bismark
    • custom threading for bismark via yaml
    • reproducible WGBS user story with the data from Encode
    • While consensus peak calling, keep the highest scoring peak instead of calling the summit for the highest scoring peak and expanding the peak to 250 bases.
    • Enable consensus peak calling for broad peaks. Thanks to @mistrm82 and @yoonsquared for pointing out this was missing.
    • Re-enable ATAC-seq tests, they work now.
    • svprioritize for mm10
    • purecn_Dx.R - mutational signatures - still requires a manual update of deconstructsigs or release of it
    • make sure purecn uses sv_regions bed to call variants
    • fix misleading disambiguation fastqc read statistics (total, hg38, mm10)
    • wgbs: nebemseq kit: add --maxins 1000 and --local to bismark align
    • WGBS: sorted indexed deduplicated bam for ready.bam
    • print error message when aligner: false and hla typing is on
    • make sure that mark_duplicates is false with collapsed UMI input
    Source code(tar.gz)
    Source code(zip)
  • v1.2.7(Feb 23, 2021)

    • RNASeq: Add gene body coverage plots to multiqc report.
    • Restore ability to opt out of contamination checking via tools_off.
    • Properly invoke threading for verifybamid2.
    • Fix circular import issue when using bcbio functions outside of the main bcbio script.
    • Enable setting custom PureCN options via YAML file.
    Source code(tar.gz)
    Source code(zip)
  • v1.2.6(Feb 5, 2021)

    • RNASeq: Fail more gracefully if SummarizedExperiment object cannot be created.
    • Fixes to handle DRAGEN BAM files from the first stage of UMI processing.
    • Fix issue with double-annotating with dbSNP. Separating out somatic variant annotation into it's own vcfanno configuration.
    Source code(tar.gz)
    Source code(zip)
  • v1.2.5(Jan 9, 2021)

    1.2.5 (01 January 2021)

    • Joint calling for RNA-seq variant calling requires setting jointcaller to bring it in line with the configuration options for variant calling.
    • Allow pre-aligned BAMs and gVCFs for RNA-seq joint variant calling. Thanks to @WimSpree for the feature.
    • Allow CollectSequencingArtifacts to be turned off via tools_off: [collectsequencingartifacts].
    • Fix getiterator -> iter deprecation in ElementTree. Thanks to @smoe.
    • Add SummarizedExperiment object from RNA-seq runs, a simplified version of the bcbioRNASeq object.
    • Add umi_type: dragen. This enables bcbio to run with first-pass, pre-consensus called UMI BAM files from DRAGEN.
    • Turn off inferential replicate loading when creating the gene x sample RNA-seq count matrix. This allows loading of thousands of RNA-seq samples.
    • Only make isoform to gene file from express if we have run express.
    • Allow "no consensus peaks found" as a valid endpoint of a ChIP-seq analysis.
    • Allow BCBIO_TEST_DIR environment variable to control where tests end up.
    • Collect OxoG and other sequencing artifacts due to damage.
    • Round tximport estimated counts.
    • Turn off consensus peak calling for broad peaks. Thanks to @lbeltrame and @LMannarino for diagnosing the broad-peaks-run-forever bug.
    Source code(tar.gz)
    Source code(zip)
  • v1.2.4(Sep 21, 2020)

    1.2.4 (21 September 2020)

    • Remove deprecated --genomicsdb-use-vcf-codec option as this is now the default.
    • Add bismark output to MultiQC.
    • Fix PS genotype field from octopus to have the correct type.
    • Edit VarDict headers to report VCFv4.2, since htsjdk does not fully support VCFv4.3 yet.
    • Attempt to speed up bismark by implementing the parallelization strategy suggested here: https://github.com/FelixKrueger/Bismark/issues/96
    • Add --enumerate option to OptiType to report the top 10 calls and scores, to make it easier to decide how confident we are in a HLA call.
    • Performance improvements when HLA calling during panel sequencing. This skips running bwa-kit during the initial mapping for consensus UMI detection, greatly speeding up panel sequencing runs.
    • Allow custom options to be passed to featureCounts.
    • Fix race condition when running tests.
    • Add TOPMed as a datatarget.
    • Add predicted transcript and peptide output to arriba.
    • Add mm10 as a supported genome for arriba.
    • Skip bcbioRNASeq for more than 100 samples.
    • Add rRNA_pseudogene as a rRNA biotype.
    • Add --genomicsdb-use-vcf-codec when running GenotypeGVCF. See https://gatk.broadinstitute.org/hc/en-us/articles/360040509751- GenotypeGVCFs#--genomicsdb-use-vcf-codec for a discussion. Thanks to @amizeranschi for finding the issue and posting the solution.
    • update VEP to v100
    • Add consensus peak calling using https://bedops.readthedocs.io/en/latest/content/usage-examples/master-list.html to collapse overlapping peaks.
    • Pre-filter consensus peaks by removing peaks with FDR > 0.05 before performing consensus peak calling.
    • Add support for Qiagen's Qiaseq UPX 3' transcriptome kit for DGE. Support for 96 and 384 well configurations by specifying umi_type: qiagen-upx-96 or umi_type: qiagen-upx-384.
    • Add consensus peak counting using featureCounts.
    • Skip using autosomal-reference when calling ataqv for mouse/human, as this has a problem with ataqv (see https://github.com/ParkerLab/ataqv/issues/10) for discussion and followup.
    • Add pre-generated ataqv HTML report to upload directory.
    • Support single-end reads for ATAC-seq.
    • Move featureCount output files to featureCounts directory in project directory.
    • Remove RNA and reads in peak stats from MultiQC table when they are not calculated for a pipeline.
    • Only show somatic variant counts in the general stats table, if germline variants are calculated.
    • Add kit parameter for setting options for pipelines via just listing the kit. Currently only implemented for WGBS.
    Source code(tar.gz)
    Source code(zip)
  • v1.2.3(Apr 7, 2020)

  • v1.2.2(Apr 5, 2020)

    • Fix for not properly looking up R environment variables in the base environment.
    • Remove --use-new-qual-calculator which was eliminated in GATK 4.1.5.0.
    • Ensure header is not written for a Series. In pandas 0.24.0 the default for header was changed from False to True so we have to set it explictly now.
    • Remove unused Dockerfile. Thanks to @matthdsm.
    • ATAC-seq: Skip peak-calling on fractions with < 1000 reads.
    Source code(tar.gz)
    Source code(zip)
  • v1.2.1(Mar 25, 2020)

    • Update ChIP and ATAC bowtie2 runs to use --very-sensitive.
    • Properly pad TSS BED file for ataqv TSS enrichment metrics.
    • Skip bcbioRNASeq if there are less than three samples.
    • Run joint-calling with single cores to save resources.
    • Re-support PureCN.
    • Skip segments with no informative SNPs when creating the LOH VCF file from PureCN output.
    • Fix for duplicated output for mosdepth in quality control report.
    • Fix for missing rRNA statistics.
    Source code(tar.gz)
    Source code(zip)
  • v1.2.0(Feb 7, 2020)

    • Fix for bismark not being a supported aligner.
    • Run ataqv (https://github.com/ParkerLab/ataqv) to calculate additional ATAQ-seq quality control metrics.
    • Workaround for some bcbioRNASeq plots failing with many samples when interesting_groups is not set.
    • Add known_fusions parameter for passing in known fusions to arriba.
    • Fix for tx2gene not working properly on some GTF files.
    • Sort MACS2 output with UNIX sort to avoid memory issues.
    • Run RiP on full peak file for ATAC-seq.
    • Run ataqv on unfiltered BAM file with the full peak file.
    • Run peddy on the population variant file, not the individual sample level file if joint calling was done.
    • Add STAR to MultiQC metrics.
    • Throw an error if STAR is run on a genome with alts.
    • Don't run bcbioRNASeq if there is only one sample. Thanks to @kmendler for the suggestion.
    • Improve arriba sensitivity by setting --peOverlapNbasesMin 10 and --alignSplicedMateMapLminOverLmate 0.5 when running STAR (see https://github.com/suhrig/arriba/issues/41).
    • Make TPM and counts files from tximport automatically.
    • Use --keepDuplicates when making the Salmon index. This keeps transcripts that are identical in the index instead of randomly choosing one. This helps when comparing to other ways of quantifying the transcripts, ensuring all of the transcripts are represented.
    • Remove unnecessary "quant" subdirectory for Salmon runs. This allows MultiQC to properly name the samples.
    • Ensure STAR log file is propagated to the upload directory.
    • Fix issue with memory not being specified properly when running bcbio_prepare_samples.py.
    • Run tximport automatically and store TPM in project/date/tpm and counts in project/date/counts.
    • Calculate ENCODE quality flags for ATAC-seq. See https://www.encodeproject.org/data-standards/terms/#library for a description of what the metrics mean.
    • Fix for command line being too long while joint genotyping thousands of samples.
    • Fix for command line being too long when running the CWL workflow with cromwell.
    Source code(tar.gz)
    Source code(zip)
  • v1.1.9(Dec 6, 2019)

    • Fix for get VEP cache.
    • Support Picard's new syntax for ReorderSam (REFERENCE -> SEQUENCE_DICTIONARY).
    • Remove mitochondrial reads from ChIP/ATAC-seq calling.
    • Add documentation describing ATAC-seq outputs.
    • Add ENCODE library complexity metrics for ATAC/ChIP-seq to MultiQC report (see https://www.encodeproject.org/data-standards/terms/#library for a description of the metrics)
    • Add STAR sample-specific 2-pass. This helps assign a moderate number of reads per genes. Thanks to @naumenko-sa for the intial implementation and push to get this going.
    • Index transcriptomes only once for pseudo/quasi aligner tools. This fixes race conditions that can happen.
    • Add --buildversion option, for tracking which version of a gene build was used. This is used during bcbio_setup_genome.py. Suggested formats are source_version, so Ensembl_94, EnsemblMetazoa_25, FlyBase_26, etc.
    • Sort MACS2 bedgraph files before compressing. Thanks to @LMannarino for the suggestion.
    • Check for the reserved field sample in RNA-seq metadata and quit with a useful error message. Thanks to @marypiper for suggesting this.
    • Split ATAC-seq BAM files into nucleosome-free and mono/di/tri nucleosome files, so we can call peaks on them separately.
    • Call peaks on NF/MN/DN/TN regions separately for each caller during ATAC-seq.
    • Allow viral contamination to be assasyed on non tumor/normal samples.
    • Ensure EBV coverage is calculated when run on genomes with it included as a contig.
    Source code(tar.gz)
    Source code(zip)
  • v1.1.8(Oct 29, 2019)

    • Add antibody configuration option. Setting a specific antibody for ChIP-seq will use appropriate settings for that antibody. See the documentation for supported antibodies.
    • Add use_lowfreq_filter for forcing vardict to report variants with low allelic frequency, useful for calling somatic variants in panels with high coverage.
    • Fix for checking for pre-existing inputs with python3.
    • Add keep_duplicates option for ChIP/ATAC-seq which does not remove duplicates before peak calling. Defaults to False.
    • Add keep_multimappers for ChIP/ATAC-seq which does not remove multimappers before peak calling. Defaults to False.
    • Remove ethnicity as a required column in PED files.
    Source code(tar.gz)
    Source code(zip)
  • v1.1.7(Oct 11, 2019)

  • v1.1.6(Oct 10, 2019)

    • GATK ApplyBQSRSpark: avoid StreamClosed issue with GATK 4.1+
    • RNA-seq: fixes for cufflinks preparation due to python3 transition.
    • RNA-seq: output count tables from tximport for genes and transcripts. These are in bcbioRNASeq/results/date/genes/counts and bcbioRNASeq/results/data/transcripts/counts.
    • qualimap (RNA-seq): disable stranded mode for qualimap, as it gives incorrect results with the hisat2 aligner and for RNA-seq just setting it to unstranded
    • Add quantify_genome_alignments option to use genome alignments to quantify with Salmon.
    • Add --validateMappings flag to Salmon read quantification mode.
    • VEP cache is not installing anymore from bcbio run
    • Add support for Salmon SA method when STAR alignments are not available (for hg38).
    • Add support for the new read model for filtering in Mutect2. This is experimental, and a little flaky, so it can optionally be turned on via: tools_on: mutect2_readmodel. Thanks to @lbeltrame for implementing this feature and doing a ton of work debugging.
    • Swap pandas from_csv call to read_csv.
    • Make STAR respect the transcriptome_gtf option.
    • Prefix regular expression with r. Thanks to @smoe for finding all of these.
    • Add informative logging messages at beginning of bcbio run. Includes the version and the configuration files being used.
    • Swap samtools mpileup to use bcftools mpileup as samtools mpileup is being deprecated (https://github.com/samtools/samtools/releases/tag/1.9).
    • Ensure locale is set to one supporting UTF-8 bcbio-wide. This may need to get reverted if it introduces issues.
    • Added hg38 support for STAR. We did this by taking hg38 and removing the alts, decoys and HLA sequences.
    • Added support for the arriba fusion caller.
    • Added back missing programs from the version provenance file. Fixed formatting problems introduced by switch to python3.
    • Added initial support for whole genome bisulfite sequencing using bismark. Thanks to @hackdna for implementing this and @jnhutchinson for drafting the initial pipeline. This is a work in progress in collaboration with @gcampanella, who has a similar implementation with some extra features that we will be merging in soon.
    • qualimap for RNA-seq runs on the downsampled BAM files by default. Set tools_on: [qualimap_full] to run on the full BAM files.
    • Add STAR junction files to the files captured at the end of a run.
    Source code(tar.gz)
    Source code(zip)
Owner
Blue Collar Bioinformatics
Validated, scalable, community developed variant calling, RNA-seq and small RNA analysis
Blue Collar Bioinformatics
Dbt-core - dbt enables data analysts and engineers to transform their data using the same practices that software engineers use to build applications.

Dbt-core - dbt enables data analysts and engineers to transform their data using the same practices that software engineers use to build applications.

dbt Labs 6.3k Jan 08, 2023
Deep universal probabilistic programming with Python and PyTorch

Getting Started | Documentation | Community | Contributing Pyro is a flexible, scalable deep probabilistic programming library built on PyTorch. Notab

7.7k Dec 30, 2022
Random dataframe and database table generator

Random database/dataframe generator Authored and maintained by Dr. Tirthajyoti Sarkar, Fremont, USA Introduction Often, beginners in SQL or data scien

Tirthajyoti Sarkar 249 Jan 08, 2023
A CLI tool to reduce the friction between data scientists by reducing git conflicts removing notebook metadata and gracefully resolving git conflicts.

databooks is a package for reducing the friction data scientists while using Jupyter notebooks, by reducing the number of git conflicts between different notebooks and assisting in the resolution of

dataroots 86 Dec 25, 2022
a tool that compiles a csv of all h1 program stats

h1stats - h1 Program Stats Scraper This python3 script will call out to HackerOne's graphql API and scrape all currently active programs for informati

Evan 40 Oct 27, 2022
Retentioneering 581 Jan 07, 2023
Working Time Statistics of working hours and working conditions by industry and company

Working Time Statistics of working hours and working conditions by industry and company

Feng Ruohang 88 Nov 04, 2022
Shot notebooks resuming the main functions of GeoPandas

Shot notebooks resuming the main functions of GeoPandas, 2 notebooks written as Exercises to apply these functions.

1 Jan 12, 2022
Repository created with LinkedIn profile analysis project done

EN/en Repository created with LinkedIn profile analysis project done. The datase

Mayara Canaver 4 Aug 06, 2022
GWpy is a collaboration-driven Python package providing tools for studying data from ground-based gravitational-wave detectors

GWpy is a collaboration-driven Python package providing tools for studying data from ground-based gravitational-wave detectors. GWpy provides a user-f

GWpy 342 Jan 07, 2023
Office365 (Microsoft365) audit log analysis tool

Office365 (Microsoft365) audit log analysis tool The header describes it all WHY?? The first line of code was written long time before other colleague

Anatoly 1 Jul 27, 2022
Intercepting proxy + analysis toolkit for Second Life compatible virtual worlds

Hippolyzer Hippolyzer is a revival of Linden Lab's PyOGP library targeting modern Python 3, with a focus on debugging issues in Second Life-compatible

Salad Dais 6 Sep 01, 2022
Pizza Orders Data Pipeline Usecase Solved by SQL, Sqoop, HDFS, Hive, Airflow.

PizzaOrders_DataPipeline There is a Tony who is owning a New Pizza shop. He knew that pizza alone was not going to help him get seed funding to expand

Melwin Varghese P 4 Jun 05, 2022
apricot implements submodular optimization for the purpose of selecting subsets of massive data sets to train machine learning models quickly.

Please consider citing the manuscript if you use apricot in your academic work! You can find more thorough documentation here. apricot implements subm

Jacob Schreiber 457 Dec 20, 2022
Basis Set Format Converter

Basis Set Format Converter Repository for the online tool that allows you to enter a basis set in the form of text input for a variety of Quantum Chem

Manas Sharma 3 Jun 27, 2022
songplays datamart provide details about the musical taste of our customers and can help us to improve our recomendation system

Songplays User activity datamart The following document describes the model used to build the songplays datamart table and the respective ETL process.

Leandro Kellermann de Oliveira 1 Jul 13, 2021
Average time per match by division

HW_02 Unzip matches.rar to access .json files for matches. Get an API key to access their data at: https://developer.riotgames.com/ Average time per m

11 Jan 07, 2022
Spectral Analysis in Python

SPECTRUM : Spectral Analysis in Python contributions: Please join https://github.com/cokelaer/spectrum contributors: https://github.com/cokelaer/spect

Thomas Cokelaer 280 Dec 16, 2022
Python tools for querying and manipulating BIDS datasets.

PyBIDS is a Python library to centralize interactions with datasets conforming BIDS (Brain Imaging Data Structure) format.

Brain Imaging Data Structure 180 Dec 18, 2022
Full ELT process on GCP environment.

Rent Houses Germany - GCP Pipeline Project: The goal of the project is to extract data about house rentals in Germany, store, process and analyze it u

Felipe Demenech Vasconcelos 2 Jan 20, 2022