Validated, scalable, community developed variant calling, RNA-seq and small RNA analysis

Overview

bcbio banner

Validated, scalable, community developed variant calling, RNA-seq and small RNA analysis. You write a high level configuration file specifying your inputs and analysis parameters. This input drives a parallel run that handles distributed execution, idempotent processing restarts and safe transactional steps. bcbio provides a shared community resource that handles the data processing component of sequencing analysis, providing researchers with more time to focus on the downstream biology.

Build Status Documentation status DOI

Features

Quick start

  1. Install bcbio-nextgen with all tool dependencies and data files:

    wget https://raw.githubusercontent.com/bcbio/bcbio-nextgen/master/scripts/bcbio_nextgen_install.py
    python bcbio_nextgen_install.py /usr/local/share/bcbio --tooldir=/usr/local \
          --genomes hg38 --aligners bwa --aligners bowtie2

    producing an editable system configuration file referencing the installed software, data and system information.

  2. Automatically create a processing description of sample FASTQ and BAM files from your project, and a CSV file of sample metadata:

    bcbio_nextgen.py -w template freebayes-variant project1.csv sample1.bam sample2_1.fq sample2_2.fq

    This produces a sample description file containing pipeline configuration options.

  3. Run analysis, distributed across 8 local cores:

    cd project1/work
    bcbio_nextgen.py ../config/project1.yaml -n 8

Documentation

See the full documentation and longer analysis-based articles. We welcome enhancements or problem reports using GitHub and discussion on the biovalidation mailing list.

Contributors

License

The code is freely available under the MIT license.

Comments
  • torque is hanging indefinitely

    torque is hanging indefinitely

    Hey there -

    In trying the updates for #386 we have killed our development install with 756be0ac - any job we try to run be in human, rat, mouse, or the broken dogs all hang indefinitely with torque. The nodes get checked out and the engine and clients look to be running via qstat or showq - however nothing is happening on the nodes when I look at top or ps aux. There are plenty of free nodes so this doesn't seem to a queue issue The jobs all hang until they hit the timeout and that's all I get. I dont see anything in the logs/ipython logs - Engines appear to have started successfully... I've rubbed my eyes and wiped my work dirs a few times to no avail. I checked and indeed running -t local works.... Any suggestions or additional info I can provide?

    Thanks!

    opened by caddymob 67
  • Scalpel InDel calling support

    Scalpel InDel calling support

    Looks like vcf support has been added to Scalpel recently: http://sourceforge.net/p/scalpel/code/ci/master/tree/

    Opening this ticket while I'm looking into testing Scalpel and integrating it within bcbio, bear with me

    opened by mjafin 65
  • Problems with logs and joint VCF file generation in latest dev build

    Problems with logs and joint VCF file generation in latest dev build

    Hello,

    After upgrading to the latest development version, the logs and joint VCF file generation don't seem to work properly anymore. Debug messages don't get printed anymore (neither on stdout, nor in the log file), and the bcbio-nextgen-debug.log file is pretty much identical with bcbio-nextgen.log. The only difference is the resource requests messages which appear in the debug log:

    [2018-08-02T10:10Z] Resource requests: bwa, sambamba, samtools; memory: 3.00, 3.00, 3.00; cores: 16, 16, 16
    [2018-08-02T10:10Z] Configuring 2 jobs to run, using 16 cores each with 48.1g of memory reserved for each job
    
    [2018-08-02T10:10Z] Resource requests: gatk, gatk-haplotype, picard; memory: 3.50, 3.00, 3.00; cores: 1, 16, 16
    [2018-08-02T10:10Z] Configuring 32 jobs to run, using 1 cores each with 3.50g of memory reserved for each job
    
    [2018-08-02T10:18Z] Resource requests: bcbio_variation, fastqc, gatk, gatk-vqsr, gemini, kraken, preseq, qsignature, sambamba, samtools; memory: 3.00, 3.00, 3.50, 3.00, 3.00, 3.00, 3.00, 3.00, 3.00, 3.00; cores: 16, 16, 1, 16, 16, 16, 16, 16, 16, 16
    [2018-08-02T10:18Z] Configuring 2 jobs to run, using 16 cores each with 56.1g of memory reserved for each job
    

    The multi-sample <batch>-gatk-haplotype-joint-annotated.vcf.gz did not get generated, even though the sample-specific VCF files are where they should be.

    Furthermore, bcbio-nextgen-commands.log is completely empty.

    To test all of this, I've run a simple variant calling job that worked flawlessly a few days ago, before upgrading Bcbio-nextgen.

    opened by amizeranschi 60
  • RFC / RFE: LOH analysis in tumor-normal samples

    RFC / RFE: LOH analysis in tumor-normal samples

    GIven the interest in studies that involve tumor heterogeneity / subclonality, currently bcbio offers "out of the box" support for both somatic variants and CNVs. A useful metric that can be combined (and already used by some tools, like CNVkit's plotting) is LOH, which (to my knowledge) is not yet handled.

    I admit I'm not sure if there is support already for this in bcbio. I know that back in the days I baked VarScan support to actually remove LOH calls from the VCF as they weren't truly somatic calls.

    The biggest problem here is how to actually and reliably extract these information. MuTect[2] might have these in the REJECTed calls (but how to distinguish them?), VarScan 2 calls them (might just be needed to move them away and elsewhere) and I'm not sure how FreeBayes and VarDict handle them.

    Or are there any other tools more suited for this purpose?

    I'm willing to put the money where my mouth is in this case as we're starting to explore this in my institution and having bcbio do that would greatly streamline things.

    opened by lbeltrame 59
  • Trio pipeline

    Trio pipeline

    @chapmanb

    1. I would like to run a trio analysis in whole exome samples. Can I use all callers (strelka2, deepvariant. vardict, gatk etc) for a trio analysis with samples having the same batch name? Can I use the ensemble method?

    2. I am also trying to do CNV analysis in this trio. Can I add all svcallers? Do all work with single germline sample?

    It would also be nice to specify in the documentation:

    Which callers can be used for Germline Variant Calling Which callers can only be used for Somatic (Tumor-Normal) Variant Calling Which callers can be used for Germline SV Calling Which callers can only be used for Somatic (Tumor-Normal) SV Calling Which callers can be user for Trio analysis

    opened by kokyriakidis 56
  • canfam3 dbSNP - ensembl 75

    canfam3 dbSNP - ensembl 75

    greetings! Can we add the canine dbSNP vcf to the variation resources in 9dcb447, please? I realize recallibration will not be available but getting rsIDs sure would be nice :)

    The vcf can be obtained here: ftp://ftp.ensembl.org/pub/release-75/variation/vcf/canis_familiaris/Canis_familiaris.vcf.gz

    Only thing is the canine genome for bcbio has "chr" prefixes on contigs where the dbSNP does not... I seem to recal you have a ensembl <--> ucsc conversion method from when we added the rn5 genome, so hoping this is easy without just awk'in on a 'chr' :)

    Thanks!

    opened by caddymob 56
  • Incorrect CNVkit output

    Incorrect CNVkit output

    I’ve used the cnvkit a few times, but this particular sample results in stating everything is at a loss.

    This is the head T1.cns file produced by bcbio (i removed the gene column for clarity)

    chromosome | start | end | gene | log2 | baf | depth | probes | weight
    -- | -- | -- | -- | -- | -- | -- | -- | --
    chr1 | 10044 | 3783855 | removed | -1.86927 | 0.402385 | 3.47617 | 3006 | 404.754
    chr1 | 3786057 | 12808529 | removed | -2.93556 | 0.446113 | 1.56538 | 8470 | 1231.48
    chr1 | 12810479 | 14874986 | removed | -4.6436 | 0.433824 | 0.933951 | 1538 | 225.929
    chr1 | 14876201 | 16524732 | removed | -2.55335 | 0.439711 | 1.89716 | 1584 | 229.988
    chr1 | 16525961 | 16962159 | removed | -0.14769 | 0.263566 | 4.08855 | 297 | 42.2095
    chr1 | 16962525 | 46822896 | removed | -3.13869 | 0.444444 | 1.5238 | 26862 | 3916.87
    chr1 | 46824333 | 51700063 | removed | -4.49549 | 0.449153 | 0.982765 | 3712 | 547.092
    

    note that all of the log2 values are quite negative (-call.cns is similar)

    and this is the result of running cnvkit manually:

    chromosome | start | end | gene | log2 | depth | probes | weight
    -- | -- | -- | -- | -- | -- | -- | --
    chr1 | 65409 | 7106529 | X | -0.06439 | 105.197 | 1521 | 522.478
    chr1 | 7107029 | 1.22E+08 | X |   |   |   |  
    chr1 | 1.22E+08 | 1.25E+08 | X | 1.19374 | 3.32145 | 16 | 7.01406
    chr1 | 1.44E+08 | 1.52E+08 | X |   |   |   |  
    chr1 | 1.52E+08 | 1.52E+08 | X | 0.595503 | 239.102 | 74 | 21.886
    chr1 | 1.52E+08 | 2.48E+08 | X |   |   |   |  
    chr1 | 2.48E+08 | 2.49E+08 | X | 0.304078 | 156.569 | 184 | 63.7607
    chr2 | 41359 | 93085490 | X |   |   |   |  
    chr2 | 94573375 | 1.79E+08 | X |   |   |   |  
    chr2 | 1.79E+08 | 1.79E+08 | X | 0.168907 | 216.203 | 540 | 173.011
    chr2 | 1.79E+08 | 2E+08 | X | 0.031598 | 92.5111 | 1717 | 642.207
    

    while some of the columns are 0, the results are much more close to accurate

    this is the manual command, which i don't think is particularly unique and uses the bcbio generated bam files

    cnvkit.py batch final/T1/T1-ready.bam --normal final/N1/N1-ready.bam -p 8 --targets ../S04380110_Padded_hg38_trimmed.bed --fasta /mnt/biodata/genomes/Hsapiens/hg38/seq/hg38.fa --output-dir ./cnvkit/ --diagram –scatter

    opened by choosehappy 52
  • Using UMIs in the bcbio smallRNA pipeline

    Using UMIs in the bcbio smallRNA pipeline

    Hi,

    This is somewhat similar to #2070. We have sing end .fastq files with the following format:

    @NB500965:105:HC5J5BGX2:1:11108:16467:3587 1:N:0:ATCACG TTCAAGTAATCCAGGATAGGAACTGTAGGCACCATCAATGACACCGAACGTAGATCGGAAAGCACACGTCTGAACT + AAAAAEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEAEEEEEEEEEEEEEEEAAEEE/EE

    where the bolded ATCACG = unique sample index and the bolded AACTGTAGGCACCATCAAT = 3' adapter

    Following the 3' adapter is a 12 nt UMI. If I massage the .fastq file such that they are in the format:

    @NB500965:105:HC5J5BGX2:1:11108:16467:3587 1:N:0:ATCACG:UMI_GACACCGAACGTAGA
    TTCAAGTAATCCAGGATAGGAACTGTAGGCACCATCAATGACACCGAACGTAGATCGGAAAGCACACGTCTGAACT
    +
    AAAAAEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEAEEEEEEEEEEEEEEEAAEEE/EE
    

    am I then able to add umi_type: fastq_name to the bcbio .yaml config and run through the small RNA pipeline? Is there a better way of doing this?

    All advice gratefully received.

    opened by mxhp75 51
  • AssertionErrror multisample joint calling

    AssertionErrror multisample joint calling

    Hi Brad and group,

    Recently we did a run bcbio on multisample joint calling and its failing. When we do single sample joint calling it works. Attached are the sample yaml and bcbio error files. Its complaining about coordinates, but I am not sure how did it work for single sample.

    Attached are the bcbio err file and sample file.

    Thanks,

    bcbio.stderr.txt sample.yaml.txt

    opened by DiyaVaka 51
  • RFC: allele fraction thresholds for paired analyses

    RFC: allele fraction thresholds for paired analyses

    MuTect and VarScan has a threshold setting (--tumor_f_pretest) to select sites with at least a certain fraction of non-REF alleles. Something similar is in VarScan (minimum frequency to call an allele as heterozygote). MuTect has no preset, VarScan has 0.1 by default.

    I'm wondering if (hence the RFC) this could be handled in the algorithm parameters, or at least harmonized between the two callers. Selecting a proper "frequency" (quotes, because you can't really call it frequency when you have just a sample pair) is important for validation.

    Opinions? Pro, contra?

    discussion 
    opened by lbeltrame 49
  • error in bcbio structural variant calling

    error in bcbio structural variant calling

    Hi Brad,

    Thanks for your help. I want to call structural variants, but get an error: the parallel, svtyper, cnvnator_wrapper.py, cnvnator-multi, annotate_rd.py are not found in PATH, like this:

    [2014-10-27 23:05] Uncaught exception occurred Traceback (most recent call last): File "/public/software/bcbio-nextgen/anaconda/lib/python2.7/site-packages/bcbio/provenance/do.py", line 20, in run _do_run(cmd, checks, log_stdout) File "/public/software/bcbio-nextgen/anaconda/lib/python2.7/site-packages/bcbio/provenance/do.py", line 93, in _do_run raise subprocess.CalledProcessError(exitcode, error_msg) CalledProcessError: Command 'set -o pipefail; speedseq sv -v -B ...... Sourcing executables from /public/software/bcbio-nextgen/tools/bin/speedseq.config ... which: no parallel in (/public/software/bcbio-nextgen/tools/bin:/public/software/bcbio-nextgen/anaconda/bin:.....) which: no svtyper in (/public/software/bcbio-nextgen/tools/bin:/public/software/bcbio.... which: no cnvnator_wrapper.py in (/public/software/bcbio-nextgen/tools/bin:/public/software/bcbio.... which: no cnvnator-multi in (/public/software/bcbio-nextgen/tools/bin:/public/software/bcbio-.... which: no annotate_rd.py in((/public/software/bcbio-nextgen/tools/bin:/....) Calculating alignment stats... sambamba-view: (Broken pipe) Traceback (most recent call last): File "/public/software/bcbio-nextgen/tools/share/lumpy-sv/pairend_distro.py", line 12, in import numpy as np ImportError: No module named numpy

    How can I fix this, thanks again.

    Shangqian

    opened by shang-qian 47
  • ValueError: Could not find directory in config for snpeff

    ValueError: Could not find directory in config for snpeff

    Version info

    • bcbio version (bcbio_nextgen.py --version): 1.2.5
    • OS name and version (lsb_release -ds): "CentOS Linux release 7.9.2009 (Core)"

    To Reproduce Exact bcbio command you have used:

    bcbio_nextgen.py ${yaml} -n 500 -t ipython -s slurm -q batch -r "t=4-00:00:00"  --timeout 4000 --retries 500 
    
    

    Your yaml configuration file:

    resources:
      bwa:
         cores: 8
         memory: 3.5G
      samtools:
         cores: 4
         memory: 3.5G
      gatk:
         jvm_opts: ['-Xms6g' , '-Xmx6g']
         memory: 16G
    
    details:
    - algorithm:
        adapters:
        - polya
        aligner: bwa
        jointcaller: gatk-haplotype-joint
        mark_duplicates: true
        realign: false
        save_diskspace: true
        trim_reads: read_through
        variantcaller: gatk-haplotype
        vcfanno: gemini
      analysis: variant2
      description: 10G
      files:
      - 10G_R1.fastq.gz
      - 10G_R2.fastq.gz
      genome_build: hg38
      metadata:
        batch: ksu
        sex: male
    - algorithm:
        adapters:
        - polya
        aligner: bwa
        jointcaller: gatk-haplotype-joint
        mark_duplicates: true
        realign: false
        save_diskspace: true
        trim_reads: read_through
        variantcaller: gatk-haplotype
        vcfanno: gemini
      analysis: variant2
      description: 10F
      files:
      - 10F_R1.fastq.gz
      - 10F_R2.fastq.gz
      genome_build: hg38
      metadata:
        batch: ksu
        sex: male 
    fc_name: '28'
    upload:
      dir: ../final
    

    Log files (could be found in work/log) Please attach (10MB max): bcbio-nextgen-commands.log, and bcbio-nextgen-debug.log. It works perfectly, but at the final annotation I get the following error:

    2023-01-03 08:01:35.623 [IPClusterStart] Loaded config file: /encrypted/e3008/Azza/ksu_bcbio_fam/families/28/work/log/ipython/ipcluster_config.py
    2023-01-03 08:01:35.623 [IPClusterStart] Looking for ipcluster_config in /encrypted/e3008/Azza/ksu_bcbio_fam/families/28/work
    [2023-01-03T05:02Z] cn605-27-r: Timing: variant post-processing
    [2023-01-03T05:02Z] cn605-27-r: ipython: postprocess_variants
    [2023-01-03T05:02Z] cn514-09-l: Finalizing variant calls: 10G, gatk-haplotype
    [2023-01-03T05:02Z] cn514-09-l: Calculating variation effects for 10G, gatk-haplotype
    [2023-01-03T05:02Z] cn514-09-l: Unexpected error
    Traceback (most recent call last):
      File "/ibex/sw/csi/bcbio-nextgen/1.2.5/el7.9_python2/bcbio/anaconda/lib/python3.6/site-packages/bcbio/distributed/ipythontasks.py", line 54, in _setup_logging
        yield config
      File "/ibex/sw/csi/bcbio-nextgen/1.2.5/el7.9_python2/bcbio/anaconda/lib/python3.6/site-packages/bcbio/distributed/ipythontasks.py", line 360, in postprocess_variants
        return ipython.zip_args(apply(variation.postprocess_variants, *args))
      File "/ibex/sw/csi/bcbio-nextgen/1.2.5/el7.9_python2/bcbio/anaconda/lib/python3.6/site-packages/bcbio/distributed/ipythontasks.py", line 82, in apply
        return object(*args, **kwargs)
      File "/ibex/sw/csi/bcbio-nextgen/1.2.5/el7.9_python2/bcbio/anaconda/lib/python3.6/site-packages/bcbio/pipeline/variation.py", line 97, in postprocess_variants
        ann_vrn_file, vrn_stats = effects.add_to_vcf(data[vrn_key], data)
      File "/ibex/sw/csi/bcbio-nextgen/1.2.5/el7.9_python2/bcbio/anaconda/lib/python3.6/site-packages/bcbio/variation/effects.py", line 32, in add_to_vcf
        ann_vrn_file, stats_files = snpeff_effects(in_file, data)
      File "/ibex/sw/csi/bcbio-nextgen/1.2.5/el7.9_python2/bcbio/anaconda/lib/python3.6/site-packages/bcbio/variation/effects.py", line 298, in snpeff_effects
        return _run_snpeff(vcf_in, "vcf", data)
      File "/ibex/sw/csi/bcbio-nextgen/1.2.5/el7.9_python2/bcbio/anaconda/lib/python3.6/site-packages/bcbio/variation/effects.py", line 399, in _run_snpeff
        snpeff_db, datadir = get_db(data)
      File "/ibex/sw/csi/bcbio-nextgen/1.2.5/el7.9_python2/bcbio/anaconda/lib/python3.6/site-packages/bcbio/variation/effects.py", line 353, in get_db
        snpeff_base_dir, snpeff_db = _installed_snpeff_genome(snpeff_db, data["config"])
      File "/ibex/sw/csi/bcbio-nextgen/1.2.5/el7.9_python2/bcbio/anaconda/lib/python3.6/site-packages/bcbio/variation/effects.py", line 439, in _installed_snpeff_genome
        snpeff_config_file = os.path.join(config_utils.get_program("snpeff", config, "dir"),
      File "/ibex/sw/csi/bcbio-nextgen/1.2.5/el7.9_python2/bcbio/anaconda/lib/python3.6/site-packages/bcbio/pipeline/config_utils.py", line 193, in get_program
        return _get_program_dir(name, pconfig)
      File "/ibex/sw/csi/bcbio-nextgen/1.2.5/el7.9_python2/bcbio/anaconda/lib/python3.6/site-packages/bcbio/pipeline/config_utils.py", line 249, in _get_program_dir
        raise ValueError("Could not find directory in config for %s" % name)
    ValueError: Could not find directory in config for snpeff
    
    

    Although the snpeff database is there in the bcbio directory

    ls ./bcbio/genomes/Hsapiens/hg38/
    bwa  config  coverage  editing  rnaseq  rtg  seq  snpeff  srnaseq  star  txtmp  validation  variation  vep  versions.csv  viral
    
    

    Thank you!

    opened by azzatha 0
  • recalibrate=true fails, Unsupported class file major version 55

    recalibrate=true fails, Unsupported class file major version 55

    Version info

    • bcbio version: 1.2.9
    • OS name and version: Ubuntu 18.04.5 LTS

    To Reproduce Exact bcbio command you have used:

    bcbio_nextgen.py ../config/config.yaml -n 8
    

    Your yaml configuration file:

    details:
    - algorithm:
        aligner: bwa
        exclude_regions: [lcr]
        mark_duplicates: true
        recalibrate: true
        variantcaller: [mutect2, strelka2, varscan, vardict]
        variant_regions: /media/gpudrive/apps/bcbio/genomes/Hsapiens/GRCh37/coverage/capture_regions/Exome-NGv3.bed
      analysis: variant2
      description: Patient70-normal
      files:
        - normal_1.fq.gz
        - normal_2.fq.gz
      genome_build: GRCh37
      metadata:
        batch: Patient70
        phenotype: normal
    - algorithm:
        aligner: bwa
        mark_duplicates: true
        recalibrate: true
        remove_lcr: true
        variantcaller: [mutect2, strelka2, varscan, vardict]
        variant_regions: /media/gpudrive/apps/bcbio/genomes/Hsapiens/GRCh37/coverage/capture_regions/Exome-NGv3.bed
      analysis: variant2
      description: Patient70-tumor
      files:
        - tumor_1.fq.gz
        - tumor_2.fq.gz
      genome_build: GRCh37
      metadata:
        batch: Patient70
        phenotype: tumor
    upload:
        dir: ../final
    

    Log files (could be found in work/log) Here are the important parts of the log I guess

    [2022-12-25T18:24Z] GATK: BaseRecalibratorSpark
    [2022-12-25T18:25Z] 18:25:59.390 INFO  BaseRecalibratorSpark - ------------------------------------------------------------
    [2022-12-25T18:25Z] 18:25:59.391 INFO  BaseRecalibratorSpark - The Genome Analysis Toolkit (GATK) v4.2.6.1
    [2022-12-25T18:25Z] 18:25:59.391 INFO  BaseRecalibratorSpark - For support and documentation go to https://software.broadinstitute.org/gatk/
    [2022-12-25T18:25Z] 18:25:59.391 INFO  BaseRecalibratorSpark - Executing as [email protected] on Linux v4.15.0-197-generic amd64
    [2022-12-25T18:25Z] 18:25:59.391 INFO  BaseRecalibratorSpark - Java runtime: OpenJDK 64-Bit Server VM v11.0.9.1-internal+0-adhoc..src
    [2022-12-25T18:25Z] 18:25:59.392 INFO  BaseRecalibratorSpark - Start Date/Time: December 25, 2022 at 6:25:03 PM UTC
    [2022-12-25T18:25Z] 18:25:59.392 INFO  BaseRecalibratorSpark - ------------------------------------------------------------
    [2022-12-25T18:25Z] 18:25:59.392 INFO  BaseRecalibratorSpark - ------------------------------------------------------------
    [2022-12-25T18:25Z] 18:25:59.393 INFO  BaseRecalibratorSpark - HTSJDK Version: 2.24.1
    [2022-12-25T18:25Z] 18:25:59.393 INFO  BaseRecalibratorSpark - Picard Version: 2.27.1
    [2022-12-25T18:25Z] 18:25:59.393 INFO  BaseRecalibratorSpark - Built for Spark Version: 2.4.5
    
    ...
    [2022-12-25T18:36Z] java.lang.IllegalArgumentException: Unsupported class file major version 55
    [2022-12-25T18:36Z]     at org.apache.xbean.asm6.ClassReader.<init>(ClassReader.java:166)
    [2022-12-25T18:36Z]     at org.apache.xbean.asm6.ClassReader.<init>(ClassReader.java:148)
    [2022-12-25T18:36Z]     at org.apache.xbean.asm6.ClassReader.<init>(ClassReader.java:136)
    [2022-12-25T18:36Z]     at org.apache.xbean.asm6.ClassReader.<init>(ClassReader.java:237)
    [2022-12-25T18:36Z]     at org.apache.spark.util.ClosureCleaner$.getClassReader(ClosureCleaner.scala:49)
    [2022-12-25T18:36Z]     at org.apache.spark.util.FieldAccessFinder$$anon$3$$anonfun$visitMethodInsn$2.apply(ClosureCleaner.scala:517)
    [2022-12-25T18:36Z]     at org.apache.spark.util.FieldAccessFinder$$anon$3$$anonfun$visitMethodInsn$2.apply(ClosureCleaner.scala:500)
    [2022-12-25T18:36Z]     at scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:733)
    [2022-12-25T18:36Z]     at scala.collection.mutable.HashMap$$anon$1$$anonfun$foreach$2.apply(HashMap.scala:134)
    [2022-12-25T18:36Z]     at scala.collection.mutable.HashMap$$anon$1$$anonfun$foreach$2.apply(HashMap.scala:134)
    [2022-12-25T18:36Z]     at scala.collection.mutable.HashTable$class.foreachEntry(HashTable.scala:236)
    [2022-12-25T18:36Z]     at scala.collection.mutable.HashMap.foreachEntry(HashMap.scala:40)
    [2022-12-25T18:36Z]     at scala.collection.mutable.HashMap$$anon$1.foreach(HashMap.scala:134)
    [2022-12-25T18:36Z]     at scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:732)
    [2022-12-25T18:36Z]     at org.apache.spark.util.FieldAccessFinder$$anon$3.visitMethodInsn(ClosureCleaner.scala:500)
    [2022-12-25T18:36Z]     at org.apache.xbean.asm6.ClassReader.readCode(ClassReader.java:2175)
    [2022-12-25T18:36Z]     at org.apache.xbean.asm6.ClassReader.readMethod(ClassReader.java:1238)
    [2022-12-25T18:36Z]     at org.apache.xbean.asm6.ClassReader.accept(ClassReader.java:631)
    [2022-12-25T18:36Z]     at org.apache.xbean.asm6.ClassReader.accept(ClassReader.java:355)
    [2022-12-25T18:36Z]     at org.apache.spark.util.ClosureCleaner$$anonfun$org$apache$spark$util$ClosureCleaner$$clean$14.apply(ClosureCleaner.scala:307)
    [2022-12-25T18:36Z]     at org.apache.spark.util.ClosureCleaner$$anonfun$org$apache$spark$util$ClosureCleaner$$clean$14.apply(ClosureCleaner.scala:306)
    [2022-12-25T18:36Z]     at scala.collection.immutable.List.foreach(List.scala:392)
    [2022-12-25T18:36Z]     at org.apache.spark.util.ClosureCleaner$.org$apache$spark$util$ClosureCleaner$$clean(ClosureCleaner.scala:306)
    [2022-12-25T18:36Z]     at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:162)
    [2022-12-25T18:36Z]     at org.apache.spark.SparkContext.clean(SparkContext.scala:2326)
    [2022-12-25T18:36Z]     at org.apache.spark.SparkContext.runJob(SparkContext.scala:2100)
    [2022-12-25T18:36Z]     at org.apache.spark.SparkContext.runJob(SparkContext.scala:2126)
    [2022-12-25T18:36Z]     at org.apache.spark.rdd.RDD$$anonfun$collect$1.apply(RDD.scala:990)
    [2022-12-25T18:36Z]     at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
    [2022-12-25T18:36Z]     at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
    [2022-12-25T18:36Z]     at org.apache.spark.rdd.RDD.withScope(RDD.scala:385)
    [2022-12-25T18:36Z]     at org.apache.spark.rdd.RDD.collect(RDD.scala:989)
    [2022-12-25T18:36Z]     at org.apache.spark.RangePartitioner$.sketch(Partitioner.scala:309)
    [2022-12-25T18:36Z]     at org.apache.spark.RangePartitioner.<init>(Partitioner.scala:171)
    [2022-12-25T18:36Z]     at org.apache.spark.RangePartitioner.<init>(Partitioner.scala:151)
    [2022-12-25T18:36Z]     at org.apache.spark.rdd.OrderedRDDFunctions$$anonfun$sortByKey$1.apply(OrderedRDDFunctions.scala:62)
    [2022-12-25T18:36Z]     at org.apache.spark.rdd.OrderedRDDFunctions$$anonfun$sortByKey$1.apply(OrderedRDDFunctions.scala:61)
    [2022-12-25T18:36Z]     at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
    [2022-12-25T18:36Z]     at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
    [2022-12-25T18:36Z]     at org.apache.spark.rdd.RDD.withScope(RDD.scala:385)
    [2022-12-25T18:36Z]     at org.apache.spark.rdd.OrderedRDDFunctions.sortByKey(OrderedRDDFunctions.scala:61)
    [2022-12-25T18:36Z]     at org.apache.spark.api.java.JavaPairRDD.sortByKey(JavaPairRDD.scala:936)
    [2022-12-25T18:36Z]     at org.broadinstitute.hellbender.utils.spark.SparkUtils.sortUsingElementsAsKeys(SparkUtils.java:165)
    [2022-12-25T18:36Z]     at org.broadinstitute.hellbender.engine.spark.datasources.ReadsSparkSink.sortSamRecordsToMatchHeader(ReadsSparkSink.java:207)
    [2022-12-25T18:36Z]     at org.broadinstitute.hellbender.engine.spark.datasources.ReadsSparkSink.writeReads(ReadsSparkSink.java:107)
    [2022-12-25T18:36Z]     at org.broadinstitute.hellbender.engine.spark.GATKSparkTool.writeReads(GATKSparkTool.java:374)
    [2022-12-25T18:36Z]     at org.broadinstitute.hellbender.engine.spark.GATKSparkTool.writeReads(GATKSparkTool.java:362)
    [2022-12-25T18:36Z]     at org.broadinstitute.hellbender.tools.spark.ApplyBQSRSpark.runTool(ApplyBQSRSpark.java:90)
    [2022-12-25T18:36Z]     at org.broadinstitute.hellbender.engine.spark.GATKSparkTool.runPipeline(GATKSparkTool.java:546)
    [2022-12-25T18:36Z]     at org.broadinstitute.hellbender.engine.spark.SparkCommandLineProgram.doWork(SparkCommandLineProgram.java:31)
    [2022-12-25T18:36Z]     at org.broadinstitute.hellbender.cmdline.CommandLineProgram.runTool(CommandLineProgram.java:140)
    [2022-12-25T18:36Z]     at org.broadinstitute.hellbender.cmdline.CommandLineProgram.instanceMainPostParseArgs(CommandLineProgram.java:192)
    [2022-12-25T18:36Z]     at org.broadinstitute.hellbender.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:211)
    [2022-12-25T18:36Z]     at org.broadinstitute.hellbender.Main.runCommandLineProgram(Main.java:160)
    [2022-12-25T18:36Z]     at org.broadinstitute.hellbender.Main.mainEntry(Main.java:203)
    [2022-12-25T18:36Z]     at org.broadinstitute.hellbender.Main.main(Main.java:289)
    [2022-12-25T18:36Z] 22/12/25 18:36:31 INFO ShutdownHookManager: Shutdown hook called
    Using GATK jar /pathto/bcbio/anaconda/envs/java/share/gatk4-4.2.6.1-1/gatk-package-4.2.6.1-local.jar
    
    opened by asalimih 0
  • Add --cloudbiolinux argument

    Add --cloudbiolinux argument

    Fixes the following issue: https://github.com/bcbio/bcbio-nextgen/issues/3689

    The problem originated from this commit: https://github.com/bcbio/bcbio-nextgen/commit/d61e77825f46548101db9b64776269f8e96ee220

    opened by amizeranschi 0
  • [main_samview] fail to read the header from

    [main_samview] fail to read the header from "filename.sam".

    Hello, I am getting the following error when trying to run samtools in a sam file:

    [main_samview] fail to read the header from "20201032.sam". srun: error: node2-092: task 0: Exited with exit code 1

    But when i checked the sam file (using head) it does contain the headers, so can be happening?
    @SQ SN:1 LN:278617202 @SQ SN:2 LN:250202058 @SQ SN:3 LN:226089100

    my script is as follow:

    #!/bin/bash

    #SBATCH --job-name=samtools #SBATCH --time=72:00:00 #SBATCH --partition=serial #SBATCH --ntasks=1 #SBATCH --mem-per-cpu=100GB #SBATCH [email protected] #SBATCH --mail-type=fail,end #SBATCH --error=%u.%J.err #SBATCH --output=%u.%J.out

    load all modules needed for the current run

    module purge # clean the current env module add slurm # we always need this one

    Activate the environment

    module add TOOLS python/miniconda-3.9 module add bio/samtools/1.16.1/gcc/9.2.0 source activate ngs-tools

    echo "Starting at date" echo "Running on hosts: $SLURM_NODELIST" echo "Current working directory is pwd"

    srun samtools view -bh 20201032.sam > SRR519926.bam
    samtools sort 20201032.bam > SRR519926.sorted.bam
    samtools index 20201032.sorted.bam

    Save results and final clean up

    source deactivate

    echo "Finished at date"

    opened by gabyrudd22 0
  • Error with bcbio_setup_genome.py: AttributeError: 'Namespace' object has no attribute 'cloudbiolinux'

    Error with bcbio_setup_genome.py: AttributeError: 'Namespace' object has no attribute 'cloudbiolinux'

    Hi,

    I'm getting an error when trying to create a custom genome. Here's the command I'm running and the error it produces:

    $ bcbio_setup_genome.py -f GWHBDNW00000000.genome.fasta -g GWHBDNW00000000.gff --gff3 -i bwa seq -n GWHBDNW00000000 -b build1 --buildversion None
    Traceback (most recent call last):
      File "/data/share/bcbio_nextgen/anaconda/bin/bcbio_setup_genome.py", line 249, in <module>
        cbl = get_cloudbiolinux(args, REMOTES)
      File "/data/share/bcbio_nextgen/anaconda/lib/python3.7/site-packages/bcbio/install.py", line 807, in get_cloudbiolinux
        cloudbiolinux_remote = remotes["cloudbiolinux"] % args.cloudbiolinux
    AttributeError: 'Namespace' object has no attribute 'cloudbiolinux'
    
    
    opened by amizeranschi 0
  • Bringing back Docker support, possibly as a replacement for the various Conda environments

    Bringing back Docker support, possibly as a replacement for the various Conda environments

    Inspired by a recent comment from @gabeng, I wanted to ask if it would be a great deal of effort to bring back Docker support and the creation of new Bcbio Docker images.

    One alternative to reviving bcbio-nextgen-vm (although perhaps more laborious) could be to have the possibility to replace Conda environments with several Docker containers in bcbio-nextgen itself, as they do for example in nf-core/sarek. Given how often Conda has been breaking bcbio installs during the last couple of years, it could be worth the effort to replace it, or at least offer the possibility of using Docker containers as an alternative. And this could also pave the way for Kubernetes support at some point.

    Here's a list of the Docker images currently on my system, after a few variant calling experiments with the above pipeline:

    $ docker image ls
    REPOSITORY                                                                 TAG                                          IMAGE ID       CREATED         SIZE
    nfcore/snpeff                                                              5.1.R64-1-1                                  0462080aa43c   2 weeks ago     1.4GB
    nfcore/vep                                                                 106.1.R64-1-1                                e5c98f96ae89   2 weeks ago     1.22GB
    quay.io/biocontainers/mulled-v2-d9e7bad0f7fbc8f4458d5c3ab7ffaaf0235b59fb   551156018e5580fb94d44632dfafbc9c27005a0e-0   5703dbdd3100   2 weeks ago     1.01GB
    quay.io/biocontainers/mulled-v2-780d630a9bb6a0ff2e7b6f730906fd703e40e98f   3bdd798e4b9aed6d3e1aaa1596c913a3eeb865cb-0   c4f4a546ff1b   3 weeks ago     1.26GB
    quay.io/biocontainers/mulled-v2-fe8faa35dbf6dc65a0f7f5d4ea12e31a79f73e40   219b6c272b25e7e642ae3ff0bf0c5c81a5135ab4-0   a3d569a08aa5   3 weeks ago     133MB
    quay.io/biocontainers/gatk4                                                4.3.0.0--py36hdfd78af_0                      0f8cc7afc8e6   7 weeks ago     966MB
    quay.io/biocontainers/bcftools                                             1.16--hfe4b78e_1                             7ec55dde74af   8 weeks ago     198MB
    quay.io/biocontainers/samtools                                             1.16.1--h6899075_1                           09cd4486af55   8 weeks ago     62MB
    quay.io/biocontainers/freebayes                                            1.3.6--hbfe0e7f_2                            9c664cb1521f   2 months ago    326MB
    quay.io/biocontainers/tiddit                                               3.3.2--py310hc2b7f4b_0                       e9c7cf6b37d7   2 months ago    350MB
    quay.io/biocontainers/multiqc                                              1.13--pyhdfd78af_0                           747595fd0a8e   2 months ago    431MB
    google/deepvariant                                                         1.4.0                                        decb60cd33cb   6 months ago    5.72GB
    quay.io/biocontainers/sra-tools                                            2.11.0--pl5321ha49a11a_3                     58aa27074b50   9 months ago    379MB
    quay.io/biocontainers/mosdepth                                             0.3.3--hdfd78af_1                            14b81386a558   10 months ago   22.5MB
    quay.io/biocontainers/fastp                                                0.23.2--h79da9fb_0                           371123966d85   12 months ago   52MB
    quay.io/biocontainers/mulled-v2-5f89fe0cd045cb1d615630b9261a1d17943a9b6a   6a9ff0e76ec016c3d0d27e0c0d362339f2d787e6-0   8bb307eced25   14 months ago   387MB
    quay.io/biocontainers/python                                               3.9--1                                       34c2b9e3810c   17 months ago   191MB
    quay.io/biocontainers/cnvkit                                               0.9.9--pyhdfd78af_0                          65c84d95fbda   18 months ago   1.12GB
    quay.io/biocontainers/tabix                                                1.11--hdfd78af_0                             171149a492ea   19 months ago   94.3MB
    quay.io/biocontainers/manta                                                1.6.0--h9ee0642_1                            0be19048fb6e   20 months ago   200MB
    quay.io/bcbio/bcbio-vc                                                     latest                                       196407441ba3   23 months ago   5.89GB
    quay.io/biocontainers/gawk                                                 5.1.0                                        1f25a9f620a3   2 years ago     38.6MB
    quay.io/biocontainers/vcftools                                             0.1.16--he513fc3_4                           edbf7b8881c0   2 years ago     48MB
    quay.io/biocontainers/fastqc                                               0.11.9--0                                    9d444341a7b2   2 years ago     531MB
    quay.io/biocontainers/bwa                                                  0.7.17--hed695b0_7                           5c6028c4ea33   2 years ago     109MB
    
    opened by amizeranschi 0
Releases(v1.2.9)
  • v1.2.9(Dec 15, 2021)

    • Fix vcf header bug: T/N SAMPLE lines are back - needed for import to SolveBio
    • add strandedness: auto for -l A option in salmon
    • report 10x more peaks in CHIP/ATAC-seq - use 0.05 qvalue
    • fix misleading RNA-seq duplicated reads statistics: thanks @sib-bcf
    • reorganize conda environments
    • snpEff 5.0
    • strandedness: auto
    • document WGBS pipeline steps
    • make --local an option, not default in bismark alignment - too slow
    • bcbioRNASeq update to 0.3.44
    • pureCN update to 2.0.1
    • octopus update to 0.7.4
    Source code(tar.gz)
    Source code(zip)
  • v1.2.8(Apr 14, 2021)

    • Set ENCODE library complexity flags properly for ChIP-seq. Thanks to @mistrm82.
    • Fix greylisted peaks not being propagated to the output directory. Thanks to @mistrm82.
    • Better error message when no sample barcodes are found for single-cell RNA-seq.
    • Better trimming for 2 wgbs kits
    • enable setting parameters for deduplicate_bismark
    • custom threading for bismark via yaml
    • reproducible WGBS user story with the data from Encode
    • While consensus peak calling, keep the highest scoring peak instead of calling the summit for the highest scoring peak and expanding the peak to 250 bases.
    • Enable consensus peak calling for broad peaks. Thanks to @mistrm82 and @yoonsquared for pointing out this was missing.
    • Re-enable ATAC-seq tests, they work now.
    • svprioritize for mm10
    • purecn_Dx.R - mutational signatures - still requires a manual update of deconstructsigs or release of it
    • make sure purecn uses sv_regions bed to call variants
    • fix misleading disambiguation fastqc read statistics (total, hg38, mm10)
    • wgbs: nebemseq kit: add --maxins 1000 and --local to bismark align
    • WGBS: sorted indexed deduplicated bam for ready.bam
    • print error message when aligner: false and hla typing is on
    • make sure that mark_duplicates is false with collapsed UMI input
    Source code(tar.gz)
    Source code(zip)
  • v1.2.7(Feb 23, 2021)

    • RNASeq: Add gene body coverage plots to multiqc report.
    • Restore ability to opt out of contamination checking via tools_off.
    • Properly invoke threading for verifybamid2.
    • Fix circular import issue when using bcbio functions outside of the main bcbio script.
    • Enable setting custom PureCN options via YAML file.
    Source code(tar.gz)
    Source code(zip)
  • v1.2.6(Feb 5, 2021)

    • RNASeq: Fail more gracefully if SummarizedExperiment object cannot be created.
    • Fixes to handle DRAGEN BAM files from the first stage of UMI processing.
    • Fix issue with double-annotating with dbSNP. Separating out somatic variant annotation into it's own vcfanno configuration.
    Source code(tar.gz)
    Source code(zip)
  • v1.2.5(Jan 9, 2021)

    1.2.5 (01 January 2021)

    • Joint calling for RNA-seq variant calling requires setting jointcaller to bring it in line with the configuration options for variant calling.
    • Allow pre-aligned BAMs and gVCFs for RNA-seq joint variant calling. Thanks to @WimSpree for the feature.
    • Allow CollectSequencingArtifacts to be turned off via tools_off: [collectsequencingartifacts].
    • Fix getiterator -> iter deprecation in ElementTree. Thanks to @smoe.
    • Add SummarizedExperiment object from RNA-seq runs, a simplified version of the bcbioRNASeq object.
    • Add umi_type: dragen. This enables bcbio to run with first-pass, pre-consensus called UMI BAM files from DRAGEN.
    • Turn off inferential replicate loading when creating the gene x sample RNA-seq count matrix. This allows loading of thousands of RNA-seq samples.
    • Only make isoform to gene file from express if we have run express.
    • Allow "no consensus peaks found" as a valid endpoint of a ChIP-seq analysis.
    • Allow BCBIO_TEST_DIR environment variable to control where tests end up.
    • Collect OxoG and other sequencing artifacts due to damage.
    • Round tximport estimated counts.
    • Turn off consensus peak calling for broad peaks. Thanks to @lbeltrame and @LMannarino for diagnosing the broad-peaks-run-forever bug.
    Source code(tar.gz)
    Source code(zip)
  • v1.2.4(Sep 21, 2020)

    1.2.4 (21 September 2020)

    • Remove deprecated --genomicsdb-use-vcf-codec option as this is now the default.
    • Add bismark output to MultiQC.
    • Fix PS genotype field from octopus to have the correct type.
    • Edit VarDict headers to report VCFv4.2, since htsjdk does not fully support VCFv4.3 yet.
    • Attempt to speed up bismark by implementing the parallelization strategy suggested here: https://github.com/FelixKrueger/Bismark/issues/96
    • Add --enumerate option to OptiType to report the top 10 calls and scores, to make it easier to decide how confident we are in a HLA call.
    • Performance improvements when HLA calling during panel sequencing. This skips running bwa-kit during the initial mapping for consensus UMI detection, greatly speeding up panel sequencing runs.
    • Allow custom options to be passed to featureCounts.
    • Fix race condition when running tests.
    • Add TOPMed as a datatarget.
    • Add predicted transcript and peptide output to arriba.
    • Add mm10 as a supported genome for arriba.
    • Skip bcbioRNASeq for more than 100 samples.
    • Add rRNA_pseudogene as a rRNA biotype.
    • Add --genomicsdb-use-vcf-codec when running GenotypeGVCF. See https://gatk.broadinstitute.org/hc/en-us/articles/360040509751- GenotypeGVCFs#--genomicsdb-use-vcf-codec for a discussion. Thanks to @amizeranschi for finding the issue and posting the solution.
    • update VEP to v100
    • Add consensus peak calling using https://bedops.readthedocs.io/en/latest/content/usage-examples/master-list.html to collapse overlapping peaks.
    • Pre-filter consensus peaks by removing peaks with FDR > 0.05 before performing consensus peak calling.
    • Add support for Qiagen's Qiaseq UPX 3' transcriptome kit for DGE. Support for 96 and 384 well configurations by specifying umi_type: qiagen-upx-96 or umi_type: qiagen-upx-384.
    • Add consensus peak counting using featureCounts.
    • Skip using autosomal-reference when calling ataqv for mouse/human, as this has a problem with ataqv (see https://github.com/ParkerLab/ataqv/issues/10) for discussion and followup.
    • Add pre-generated ataqv HTML report to upload directory.
    • Support single-end reads for ATAC-seq.
    • Move featureCount output files to featureCounts directory in project directory.
    • Remove RNA and reads in peak stats from MultiQC table when they are not calculated for a pipeline.
    • Only show somatic variant counts in the general stats table, if germline variants are calculated.
    • Add kit parameter for setting options for pipelines via just listing the kit. Currently only implemented for WGBS.
    Source code(tar.gz)
    Source code(zip)
  • v1.2.3(Apr 7, 2020)

  • v1.2.2(Apr 5, 2020)

    • Fix for not properly looking up R environment variables in the base environment.
    • Remove --use-new-qual-calculator which was eliminated in GATK 4.1.5.0.
    • Ensure header is not written for a Series. In pandas 0.24.0 the default for header was changed from False to True so we have to set it explictly now.
    • Remove unused Dockerfile. Thanks to @matthdsm.
    • ATAC-seq: Skip peak-calling on fractions with < 1000 reads.
    Source code(tar.gz)
    Source code(zip)
  • v1.2.1(Mar 25, 2020)

    • Update ChIP and ATAC bowtie2 runs to use --very-sensitive.
    • Properly pad TSS BED file for ataqv TSS enrichment metrics.
    • Skip bcbioRNASeq if there are less than three samples.
    • Run joint-calling with single cores to save resources.
    • Re-support PureCN.
    • Skip segments with no informative SNPs when creating the LOH VCF file from PureCN output.
    • Fix for duplicated output for mosdepth in quality control report.
    • Fix for missing rRNA statistics.
    Source code(tar.gz)
    Source code(zip)
  • v1.2.0(Feb 7, 2020)

    • Fix for bismark not being a supported aligner.
    • Run ataqv (https://github.com/ParkerLab/ataqv) to calculate additional ATAQ-seq quality control metrics.
    • Workaround for some bcbioRNASeq plots failing with many samples when interesting_groups is not set.
    • Add known_fusions parameter for passing in known fusions to arriba.
    • Fix for tx2gene not working properly on some GTF files.
    • Sort MACS2 output with UNIX sort to avoid memory issues.
    • Run RiP on full peak file for ATAC-seq.
    • Run ataqv on unfiltered BAM file with the full peak file.
    • Run peddy on the population variant file, not the individual sample level file if joint calling was done.
    • Add STAR to MultiQC metrics.
    • Throw an error if STAR is run on a genome with alts.
    • Don't run bcbioRNASeq if there is only one sample. Thanks to @kmendler for the suggestion.
    • Improve arriba sensitivity by setting --peOverlapNbasesMin 10 and --alignSplicedMateMapLminOverLmate 0.5 when running STAR (see https://github.com/suhrig/arriba/issues/41).
    • Make TPM and counts files from tximport automatically.
    • Use --keepDuplicates when making the Salmon index. This keeps transcripts that are identical in the index instead of randomly choosing one. This helps when comparing to other ways of quantifying the transcripts, ensuring all of the transcripts are represented.
    • Remove unnecessary "quant" subdirectory for Salmon runs. This allows MultiQC to properly name the samples.
    • Ensure STAR log file is propagated to the upload directory.
    • Fix issue with memory not being specified properly when running bcbio_prepare_samples.py.
    • Run tximport automatically and store TPM in project/date/tpm and counts in project/date/counts.
    • Calculate ENCODE quality flags for ATAC-seq. See https://www.encodeproject.org/data-standards/terms/#library for a description of what the metrics mean.
    • Fix for command line being too long while joint genotyping thousands of samples.
    • Fix for command line being too long when running the CWL workflow with cromwell.
    Source code(tar.gz)
    Source code(zip)
  • v1.1.9(Dec 6, 2019)

    • Fix for get VEP cache.
    • Support Picard's new syntax for ReorderSam (REFERENCE -> SEQUENCE_DICTIONARY).
    • Remove mitochondrial reads from ChIP/ATAC-seq calling.
    • Add documentation describing ATAC-seq outputs.
    • Add ENCODE library complexity metrics for ATAC/ChIP-seq to MultiQC report (see https://www.encodeproject.org/data-standards/terms/#library for a description of the metrics)
    • Add STAR sample-specific 2-pass. This helps assign a moderate number of reads per genes. Thanks to @naumenko-sa for the intial implementation and push to get this going.
    • Index transcriptomes only once for pseudo/quasi aligner tools. This fixes race conditions that can happen.
    • Add --buildversion option, for tracking which version of a gene build was used. This is used during bcbio_setup_genome.py. Suggested formats are source_version, so Ensembl_94, EnsemblMetazoa_25, FlyBase_26, etc.
    • Sort MACS2 bedgraph files before compressing. Thanks to @LMannarino for the suggestion.
    • Check for the reserved field sample in RNA-seq metadata and quit with a useful error message. Thanks to @marypiper for suggesting this.
    • Split ATAC-seq BAM files into nucleosome-free and mono/di/tri nucleosome files, so we can call peaks on them separately.
    • Call peaks on NF/MN/DN/TN regions separately for each caller during ATAC-seq.
    • Allow viral contamination to be assasyed on non tumor/normal samples.
    • Ensure EBV coverage is calculated when run on genomes with it included as a contig.
    Source code(tar.gz)
    Source code(zip)
  • v1.1.8(Oct 29, 2019)

    • Add antibody configuration option. Setting a specific antibody for ChIP-seq will use appropriate settings for that antibody. See the documentation for supported antibodies.
    • Add use_lowfreq_filter for forcing vardict to report variants with low allelic frequency, useful for calling somatic variants in panels with high coverage.
    • Fix for checking for pre-existing inputs with python3.
    • Add keep_duplicates option for ChIP/ATAC-seq which does not remove duplicates before peak calling. Defaults to False.
    • Add keep_multimappers for ChIP/ATAC-seq which does not remove multimappers before peak calling. Defaults to False.
    • Remove ethnicity as a required column in PED files.
    Source code(tar.gz)
    Source code(zip)
  • v1.1.7(Oct 11, 2019)

  • v1.1.6(Oct 10, 2019)

    • GATK ApplyBQSRSpark: avoid StreamClosed issue with GATK 4.1+
    • RNA-seq: fixes for cufflinks preparation due to python3 transition.
    • RNA-seq: output count tables from tximport for genes and transcripts. These are in bcbioRNASeq/results/date/genes/counts and bcbioRNASeq/results/data/transcripts/counts.
    • qualimap (RNA-seq): disable stranded mode for qualimap, as it gives incorrect results with the hisat2 aligner and for RNA-seq just setting it to unstranded
    • Add quantify_genome_alignments option to use genome alignments to quantify with Salmon.
    • Add --validateMappings flag to Salmon read quantification mode.
    • VEP cache is not installing anymore from bcbio run
    • Add support for Salmon SA method when STAR alignments are not available (for hg38).
    • Add support for the new read model for filtering in Mutect2. This is experimental, and a little flaky, so it can optionally be turned on via: tools_on: mutect2_readmodel. Thanks to @lbeltrame for implementing this feature and doing a ton of work debugging.
    • Swap pandas from_csv call to read_csv.
    • Make STAR respect the transcriptome_gtf option.
    • Prefix regular expression with r. Thanks to @smoe for finding all of these.
    • Add informative logging messages at beginning of bcbio run. Includes the version and the configuration files being used.
    • Swap samtools mpileup to use bcftools mpileup as samtools mpileup is being deprecated (https://github.com/samtools/samtools/releases/tag/1.9).
    • Ensure locale is set to one supporting UTF-8 bcbio-wide. This may need to get reverted if it introduces issues.
    • Added hg38 support for STAR. We did this by taking hg38 and removing the alts, decoys and HLA sequences.
    • Added support for the arriba fusion caller.
    • Added back missing programs from the version provenance file. Fixed formatting problems introduced by switch to python3.
    • Added initial support for whole genome bisulfite sequencing using bismark. Thanks to @hackdna for implementing this and @jnhutchinson for drafting the initial pipeline. This is a work in progress in collaboration with @gcampanella, who has a similar implementation with some extra features that we will be merging in soon.
    • qualimap for RNA-seq runs on the downsampled BAM files by default. Set tools_on: [qualimap_full] to run on the full BAM files.
    • Add STAR junction files to the files captured at the end of a run.
    Source code(tar.gz)
    Source code(zip)
Owner
Blue Collar Bioinformatics
Validated, scalable, community developed variant calling, RNA-seq and small RNA analysis
Blue Collar Bioinformatics
songplays datamart provide details about the musical taste of our customers and can help us to improve our recomendation system

Songplays User activity datamart The following document describes the model used to build the songplays datamart table and the respective ETL process.

Leandro Kellermann de Oliveira 1 Jul 13, 2021
BAyesian Model-Building Interface (Bambi) in Python.

Bambi BAyesian Model-Building Interface in Python Overview Bambi is a high-level Bayesian model-building interface written in Python. It's built on to

861 Dec 29, 2022
NFCDS Workshop Beginners Guide Bioinformatics Data Analysis

Genomics Workshop FIXME: overview of workshop Code of Conduct All participants s

Elizabeth Brooks 2 Jun 13, 2022
Performance analysis of predictive (alpha) stock factors

Alphalens Alphalens is a Python Library for performance analysis of predictive (alpha) stock factors. Alphalens works great with the Zipline open sour

Quantopian, Inc. 2.5k Jan 09, 2023
A computer algebra system written in pure Python

SymPy See the AUTHORS file for the list of authors. And many more people helped on the SymPy mailing list, reported bugs, helped organize SymPy's part

SymPy 9.9k Dec 31, 2022
Building house price data pipelines with Apache Beam and Spark on GCP

This project contains the process from building a web crawler to extract the raw data of house price to create ETL pipelines using Google Could Platform services.

1 Nov 22, 2021
Hue Editor: Open source SQL Query Assistant for Databases/Warehouses

Hue Editor: Open source SQL Query Assistant for Databases/Warehouses

Cloudera 759 Jan 07, 2023
yt is an open-source, permissively-licensed Python library for analyzing and visualizing volumetric data.

The yt Project yt is an open-source, permissively-licensed Python library for analyzing and visualizing volumetric data. yt supports structured, varia

The yt project 367 Dec 25, 2022
Catalogue data - A Python Scripts to prepare catalogue data

catalogue_data Scripts to prepare catalogue data. Setup Clone this repo. Install

BigScience Workshop 3 Mar 03, 2022
Improving your data science workflows with

Make Better Defaults Author: Kjell Wooding [email protected] This is the git re

Kjell Wooding 18 Dec 23, 2022
Jupyter notebooks for the book "The Elements of Statistical Learning".

This repository contains Jupyter notebooks implementing the algorithms found in the book and summary of the textbook.

Madiyar 369 Dec 30, 2022
🧪 Panel-Chemistry - exploratory data analysis and build powerful data and viz tools within the domain of Chemistry using Python and HoloViz Panel.

🧪📈 🐍. The purpose of the panel-chemistry project is to make it really easy for you to do DATA ANALYSIS and build powerful DATA AND VIZ APPLICATIONS within the domain of Chemistry using using Python a

Marc Skov Madsen 97 Dec 08, 2022
The OHSDI OMOP Common Data Model allows for the systematic analysis of healthcare observational databases.

The OHSDI OMOP Common Data Model allows for the systematic analysis of healthcare observational databases.

Bell Eapen 14 Jan 02, 2023
Python library for creating data pipelines with chain functional programming

PyFunctional Features PyFunctional makes creating data pipelines easy by using chained functional operators. Here are a few examples of what it can do

Pedro Rodriguez 2.1k Jan 05, 2023
Describing statistical models in Python using symbolic formulas

Patsy is a Python library for describing statistical models (especially linear models, or models that have a linear component) and building design mat

Python for Data 866 Dec 16, 2022
A real data analysis and modeling project - restaurant inspections

A real data analysis and modeling project - restaurant inspections Jafar Pourbemany 9/27/2021 This project represents data analysis and modeling of re

Jafar Pourbemany 2 Aug 21, 2022
ETL flow framework based on Yaml configs in Python

ETL framework based on Yaml configs in Python A light framework for creating data streams. Setting up streams through configuration in the Yaml file.

Павел Максимов 18 Jul 06, 2022
International Space Station data with Python research 🌎

International Space Station data with Python research 🌎 Plotting ISS trajectory, calculating the velocity over the earth and more. Plotting trajector

Facundo Pedaccio 41 Jun 16, 2022
Conduits - A Declarative Pipelining Tool For Pandas

Conduits - A Declarative Pipelining Tool For Pandas Traditional tools for declaring pipelines in Python suck. They are mostly imperative, and can some

Kale Miller 7 Nov 21, 2021
Ejercicios Panda usando Pandas

Readme Below we add configuration details to locally test your application To co

1 Jan 22, 2022