A hifiasm fork for metagenome assembly using Hifi reads.

Overview

A hifiasm fork for metagenome assembly using Hifi reads.

Getting Started

# Install hifiasm-meta (g++ and zlib required)
git clone https://github.com/xfengnefx/hifiasm-meta.git
cd hifiasm-meta && make

# Run
hifiasm_meta -t32 -o asm reads.fq.gz 2>asm.log
hifiasm_meta -t32 --force-rs -o asm reads.fq.gz 2>asm.log  # if the dataset has high redundancy

About this fork

Hifiasm_meta comes with a read selection module, which enables the assembly of dataset of high redundancy without compromising overall assembly quality, and meta-centric graph cleaning modules. It also handles chimeric read detection and contained reads etc more carefully in the metagenome assembly context, which, in some cases, could benefit the less represented species in the sample. We need more test samples to improve the heuristics.

Currently hifiasm_meta does not take bining info.

Output files

Contig graph: asm.p_ctg*.gfa and asm.a_ctg*.gfa

Raw unitig graph: asm.r_utg*.gfa

Cleaned unitig graph: asm.p_utg*.gfa

Contig name format: ^s[0-9]+\.[uc]tg[0-9]{6}[lc], where the s[0-9]+ is a disconnected subgraph label of the contig. It might be useful to be able to quickly checking whether two contigs are in the same disconnected subgraph (i.e. haplotype that wasn't assembled in to a single contig, tangled haplotypes).

Special Notes

Based on the limited available test data, real datasets are unlikely to require read selection; mock datasets, however, might need it.

Bin file is one-way compatible with the stable hifiasm for now: stable hifiasm can use hifiasm_meta's bin file, but not vice versa. Meta needs to store extra info from overlap & error correction step.

Switches

See also README_ha.md, the stable hifiasm doc.

# Interface
-B		Name of bin files. Allows to use bin files from other 
       		directories.

# Read selection
-S		Enable read selection.
--force-rs       Force kmer frequency-based read selection. 
                (otherwise if total number of read overlaps 
                 look realistic, won't do selection.)
--lowq-10       Lower 10% quantile kmer frequency threshold, runtime. Lower value means less reads kept, if read selection is triggered. [150]

# Auxiliary 
--write-paf     Dump overlaps, produces 2 files, one contains the intra-haplotype or unphased overlaps, the other contains inter-haplotype overlaps. If coverage is very high, this might not be the full set of overlaps.
--dump-all-ovlp Dump all overlaps ever calculated during the final overlaping. 
--write-ec      Dump error corrected reads.
-e              Ban assembly, i.e. terminate before generating string graph. 

Preliminary results

We evaluated hifiasm-meta on the following public datasets:

accession #bases (Gb) N50 read
length (kb)
Median read QV Sample description
ATCC SRR11606871 59.2 12.0 36 Mock, ATCC MSA-1003
zymoBIOMICS SRR13128014 18.0 10.6 40 Mock, ZymoBIOMICS D6331
sheepA SRR10963010 51.9 14.3 25 Sheep gut microbiome
sheepB SRR14289618 206.4 11.8 N/A* Sheep gut microbiome
humanO1 SRR15275213 18.5 11.4 40 Human gut, pool of 4 omnivore samples
humanO2 SRR15275212 15.5 10.3 41 Human gut, pool of 4 omnivore samples
humanV1 SRR15275211 18.8 11.0 39 Human gut, pool of 4 vegan samples
humanV2 SRR15275210 15.2 9.6 40 Human gut, pool of 4 vegan samples
chicken SRR15214153 33.6 17.6 30 Chicken gut microbiome

*Base quality was not available for this dataset.

In the empirical datasets, we evaluated assemblies with checkM. Following the convention, we define near-complete as having at more than 90% checkM completeness score and less than 5% contamination score. High-quality is defined as >70% complete and <10% contaminated. Medium-quality is defined as >50% complete and QS>50, where QS (quality score) is given by completeness-(5*contamination). Binning was performed with metabat2. Additionally, we split out any >1Mb circles from genome bins and let them form bins on themselves.

>1Mb circular contigs >1Mb circular contigs,
near-complete
Near-complete MAGs High-quality MAGs Medium-quality MAGs
sheepA 139 125 186 42 33
sheepB 245 219 377 55 47
chicken 69 57 87 20 15
humanO1 33 27 53 20 19
humanO2 26 23 48 17 16
humanV1 38 33 73 23 15
humanV2 34 27 53 22 17
humanPooled 75 62 109 39 41

A Bandage plot of sheepA's primary contig graph (screenshot omitted some small unconnected contigs at the bottom):

ATCC contained 20 species and zymoBIOMICS contained 21 strains of 17 species. Hifiasm-meta recovered 14 out of 15 abundant (0.18%-18%) species in ATCC as single complete contigs. The other 5 rare species had insufficient coverage to be fully assembled. The challenge of the zymoBIOMICS dataset is its mixture of 5 E.coli strains (8% abundance each). Hifiasm-meta assembled strain B766 into a complete circular contig, strain B3008 into 2 contigs and the rest as fragmented contigs.

The two mock datasets were assembled with --force-rs -A, the rest used default. Performance on 48 threads (-t48):

Wall clock (h) PeakRSS (Gb)
ATCC 22 323
zymoBIOMICS 5.3 131
sheepA 17.8 208
sheepB 214 724
chicken 15.8 201
humanO1 3 70
humanO2 2.3 69
humanV1 3.4 76
humanV2 2.2 62
humanPooled 18 224
Comments
  • GFA file size issue

    GFA file size issue

    Hi, I use hifiasm-meta to assemble urogenital tract metagenomics data from CAMI.

    This data was simulated by CAMISIM, average read length: 3,000 bp, read length s.d.: 1,000 bp.

    Run log:

    $ hifiasm_meta -o cami_0.hifiasm_meta.out -t 32 /database/openstack.cebitec.uni-bielefeld.de/swift/v1/CAMI_Urogenital_tract/pacbio/2018.01.23_14.08.31_sample_0/reads/anonymous_reads.fq.gz
    
    [M::hamt_assemble] Skipped read selection.
    [M::ha_analyze_count] lowest: count[16383] = 0
    [M::hamt_ft_gen::278.101*[email protected]] ==> filtered out 0 k-mers occurring 750 or more times
    [M::hamt_assemble] Generated flt tab.
    alloc 1666925 uint16_t
    [M::ha_pt_gen::398.464*4.70] ==> counted 131777689 distinct minimizer k-mers
    [M::ha_pt_gen] count[16383] = 0 (for sanity check)
    [M::ha_analyze_count] lowest: count[16383] = 0
    tot_cnt=59765
    tot_pos=59765
    [M::ha_pt_gen::431.595*5.13] ==> indexed 59765 positions
    [M::hamt_assemble::439.470*[email protected]] ==> corrected reads for round 1
    [M::hamt_assemble] # bases: 4957619989; # corrected bases: 0; # recorrected bases: 0
    [M::hamt_assemble] size of buffer: 0.132GB
    [M::ha_pt_gen::470.852*6.04] ==> counted 131777979 distinct minimizer k-mers
    [M::ha_pt_gen] count[16383] = 0 (for sanity check)
    [M::ha_analyze_count] lowest: count[16383] = 0
    tot_cnt=59765
    tot_pos=59765
    [M::ha_pt_gen::506.590*6.28] ==> indexed 59765 positions
    [M::hamt_assemble::514.866*[email protected]] ==> corrected reads for round 2
    [M::hamt_assemble] # bases: 4957619989; # corrected bases: 0; # recorrected bases: 0
    [M::hamt_assemble] size of buffer: 0.132GB
    [M::ha_pt_gen::559.852*6.81] ==> counted 131777979 distinct minimizer k-mers
    [M::ha_pt_gen] count[16383] = 0 (for sanity check)
    [M::ha_analyze_count] lowest: count[16383] = 0
    tot_cnt=59765
    tot_pos=59765
    [M::ha_pt_gen::597.090*6.98] ==> indexed 59765 positions
    [M::hamt_assemble::606.630*[email protected]] ==> corrected reads for round 3
    [M::hamt_assemble] # bases: 4957619989; # corrected bases: 0; # recorrected bases: 0
    [M::hamt_assemble] size of buffer: 0.132GB
    [M::ha_pt_gen::643.258*7.55] ==> counted 131777979 distinct minimizer k-mers
    [M::ha_pt_gen] count[16383] = 0 (for sanity check)
    [M::ha_analyze_count] lowest: count[16383] = 0
    tot_cnt=59765
    tot_pos=59765
    [M::ha_pt_gen::674.827*7.68] ==> indexed 59765 positions
    [M::hamt_assemble::683.525*[email protected]] ==> found overlaps for the final round
    [M::ha_print_ovlp_stat] # overlaps: 0
    [M::ha_print_ovlp_stat] # strong overlaps: 0
    [M::ha_print_ovlp_stat] # weak overlaps: 0
    [M::ha_print_ovlp_stat] # exact overlaps: 0
    [M::ha_print_ovlp_stat] # inexact overlaps: 0
    [M::ha_print_ovlp_stat] # overlaps without large indels: 0
    [M::ha_print_ovlp_stat] # reverse overlaps: 0
    [M::hist_readlength] <1.0k:
    [M::hist_readlength] 1.0k: ]]]]]]]]
    [M::hist_readlength] 1.5k: ]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]
    [M::hist_readlength] 2.0k: ]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]
    [M::hist_readlength] 2.5k: ]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]
    [M::hist_readlength] 3.0k: ]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]
    [M::hist_readlength] 3.5k: ]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]                                                                                                                                                                                    
    [M::hist_readlength] 4.0k: ]]]]]]]]]]]]]]]]]]]]]]]
    [M::hist_readlength] 4.5k: ]]]]]]]]]]]]]
    [M::hist_readlength] 5.0k: ]]]]]]]
    [M::hist_readlength] 5.5k: ]]]]
    [M::hist_readlength] 6.0k: ]]
    [M::hist_readlength] 6.5k: ]
    [M::hist_readlength] 7.0k: ]
    [M::hist_readlength] 7.5k: ]
    [M::hist_readlength] 8.0k: ]
    [M::hist_readlength] 8.5k: ]
    [M::hist_readlength] 9.0k: ]
    [M::hist_readlength] 9.5k: ]
    [M::hist_readlength] 10.0k: ]
    [M::hist_readlength] 10.5k: ]
    [M::hist_readlength] 11.0k: ]
    [M::hist_readlength] 11.5k: ]
    [M::hist_readlength] >50.0k: 0
    Writing reads to disk...
    wrote cmd of length 323: version=0.13-r308, CMD= hifiasm_meta -o cami_0.hifiasm_meta.out -t 32 /database/openstack.cebitec.uni-bielefeld.de/swift/v1/CAMI_Urogenital_tract/pacbio/2018.01.23_14.08.31_sample_0/reads/anonymous_reads.fq.gz
    Bin file was created on Wed Dec 30 15:31:02 2020
    Hifiasm_meta 0.1-r022 (hifiasm code base 0.13-r308).
    Reads has been written.
    [hamt::write_All_reads] Writing per-read coverage info...
    [hamt::write_All_reads] Finished writing.
    Writing ma_hit_ts to disk...
    ma_hit_ts has been written.
    Writing ma_hit_ts to disk...
    ma_hit_ts has been written.
    bin files have been written.
    Writing raw unitig GFA to disk...
    [M::hamt_output_unitig_graph_advance] Writing GFA...
    [M::hamt_output_unitig_graph_advance] Writing GFA...
    [M::hamt_output_unitig_graph_advance] Writing GFA...
    Inconsistency threshold for low-quality regions in BED files: 70%
    Writing debug asg to disk...
    [M::write_debug_assembly_graph] took 0.02s
    
    [M::main] Hifiasm code base version: 0.13-r308
    [M::main] Hifiasm_meta version: 0.1-r022
    [M::main] CMD: hifiasm_meta -o cami_0.hifiasm_meta.out -t 32 /database/openstack.cebitec.uni-bielefeld.de/swift/v1/CAMI_Urogenital_tract/pacbio/2018.01.23_14.08.31_sample_0/reads/anonymous_reads.fq.gz
    [M::main] Real time: 691.048 sec; CPU: 5463.747 sec; Peak RSS: 16.432 GB
    

    Output:

    $ ll
    .rw-r--r-- zhujie 2782    0 B  Wed Dec 30 15:31:07 2020 cami_0.hifiasm_meta.out.a_ctg.gfa
    .rw-r--r-- zhujie 2782    0 B  Wed Dec 30 15:31:07 2020 cami_0.hifiasm_meta.out.a_ctg.noseq.gfa
    .rw-r--r-- zhujie 2782    0 B  Wed Dec 30 15:31:07 2020 cami_0.hifiasm_meta.out.dbg_asg
    .rw-r--r-- zhujie 2782  1.2 GB Wed Dec 30 15:31:04 2020 cami_0.hifiasm_meta.out.ec.bin
    .rw-r--r-- zhujie 2782 38.2 MB Wed Dec 30 15:31:04 2020 cami_0.hifiasm_meta.out.ec.mt.bin
    .rw-r--r-- zhujie 2782  6.7 MB Wed Dec 30 15:31:00 2020 cami_0.hifiasm_meta.out.ovecinfo.bin
    .rw-r--r-- zhujie 2782  9.5 MB Wed Dec 30 15:31:04 2020 cami_0.hifiasm_meta.out.ovlp.reverse.bin
    .rw-r--r-- zhujie 2782  9.5 MB Wed Dec 30 15:31:04 2020 cami_0.hifiasm_meta.out.ovlp.source.bin
    .rw-r--r-- zhujie 2782    0 B  Wed Dec 30 15:31:07 2020 cami_0.hifiasm_meta.out.p_ctg.gfa
    .rw-r--r-- zhujie 2782    0 B  Wed Dec 30 15:31:07 2020 cami_0.hifiasm_meta.out.p_ctg.noseq.gfa
    .rw-r--r-- zhujie 2782    0 B  Wed Dec 30 15:31:07 2020 cami_0.hifiasm_meta.out.p_utg.gfa
    .rw-r--r-- zhujie 2782    0 B  Wed Dec 30 15:31:07 2020 cami_0.hifiasm_meta.out.p_utg.noseq.gfa
    .rw-r--r-- zhujie 2782    0 B  Wed Dec 30 15:31:07 2020 cami_0.hifiasm_meta.out.r_utg.gfa
    .rw-r--r-- zhujie 2782    0 B  Wed Dec 30 15:31:07 2020 cami_0.hifiasm_meta.out.r_utg.lowQ.bed
    .rw-r--r-- zhujie 2782    0 B  Wed Dec 30 15:31:07 2020 cami_0.hifiasm_meta.out.r_utg.noseq.gfa
    

    All GFA file size is zero.

    Any help? Thanks ~

    opened by alienzj 14
  • Good settings for enriched similar sequences

    Good settings for enriched similar sequences

    Hi, We are struggling to perform de novo assembly of meta bacterial samples selectively cultured with antimicrobials from wasterwater using hifiasm-meta with the default parameters. The sequencing depth seemed to be fine, but the number of circulated bacterial genomes and plasmids is not large, so the resulted contigs would not be good. We guess the cause might be due to the increased redundancy of sequences (bacterial species and plasmids). Someone knows if there are any effective settings to deal with this kind of data? Thanks!

    opened by suzukimicro 8
  • hifiasm-meta produces redundant assemblies?

    hifiasm-meta produces redundant assemblies?

    Hello,

    I performed de novo assembly on two human faecal metagenomes sequenced with PacBio Sequel II. I tested metaFlye (2.9-b1768) and hifiasm-meta (v0.2.1). As you can see below, hifiasm-meta produces much larger assemblies.

    I mapped on the PacBio assemblies Illumina paired-end reads obtained from the same samples. Even if the assemblies of hifiasm_meta are much larger, the proportion of mapped reads only increases slightly. In addition, the proportion of reads aligned exactly 1 time is much lower. This suggests that hifiasm-meta produces redundant assemblies. What do you think?

    Thanks for you help, Florian

    Donor 1

    | | metaFlye | hifiasm_meta | | ---------------------------------------------------- | ----------- | ------------- | | assembly size (bp) | 596 522 308 | 831 187 874 | | # contigs | 9 253 | 15 586 | | N50 (bp) | 164 736 | 132 052 | | % illumina reads aligned concordantly exactly 1 time | 50.79 | 39.45 | | % illumina reads aligned concordantly > 1 time | 23.50 | 38.31 | | % illumina reads aligned concordantly | 74.29 | 77.76 |

    Donor 2

    | | metaFlye | hifiasm_meta | | ---------------------------------------------------- | ----------- | ------------- | | assembly size (bp) | 264 656 715 | 551 812 461 | | # contigs | 3 836 | 17 080 | | N50 (bp) | 243 801 | 44 732 | | % illumina reads aligned concordantly exactly 1 time | 55.28 | 20.34 | | % illumina reads aligned concordantly > 1 time | 33.15 | 74.26 | | % illumina reads aligned concordantly | 88.43 | 94.6 |

    opened by fplaza 6
  • No circular contigs recovered

    No circular contigs recovered

    Hi,

    I have tested hifiasm-meta on a pacbio hifi data obtained from fecal metagenome of a healthy human.

    Below are the library statistics:

    sum = 13017330229, n = 1646208, ave = 7907.46, largest = 21324
    N50 = 8596, n = 605631
    N60 = 7871, n = 763863
    N70 = 7149, n = 937306
    N80 = 6377, n = 1129752
    N90 = 5392, n = 1350332
    N100 = 104, n = 1646208
    

    Below are the assembly statistics (asm.p_ctg.gfa):

    sum = 831324548, n = 15560, ave = 53427.03, largest = 3704035
    N50 = 132051, n = 896
    N60 = 73324, n = 1769
    N70 = 45743, n = 3226
    N80 = 29874, n = 5501
    N90 = 19672, n = 8924
    N100 = 2682, n = 15560
    

    Unfortunately, it seems that there are no circular contigs even if some contigs are very long (>3Mb) Here is a this screenshot: image

    Is there something i'm doing wrong ?

    Thanks for your help, Florian

    opened by fplaza 5
  • Potential for improvement: A great test dataset here!

    Potential for improvement: A great test dataset here!

    This project is quite exciting, but like you mentioned in your pre-print, there is very little public training data to help optimize for this use-case.

    I'd like to point the authors to a substantially larger and more representative dataset: 11 real individual human HiFi fecal metagenomes (which are NOT pooled). They have a more realistic distribution of species (some highly abundant but many lower-abundant ones).

    PRJNA754443 11_sra_samples.csv

    Expected differences seen in this real dataset compared to the "pooled" samples used to benchmark this:

    1. These new samples have less equitable (but arguably more realistic) distributions of microbes than the pooled samples because you aren't merging multiple non-overlapping sets of high-abundance bugs; there is more of an exponential decay in abundances.
    2. These new samples would be expected to have potentially less tangled graphs, as they are less likely to contain mixtures of near-identical strains from different people in the same sample. Large numbers of closely-related genomes are less likely to be found within a given individual when evolutionary selection has taken place to limit the diversity of closely-related strains competing for the same resources/niches within the gut
    3. Overall depth is slightly lower with a median of roughly 1 million reads of 7kb length.
    4. Despite point 3, there may be more potential to capture rare microbes because these single samples have twice the effective read depth per human subject than the pooled samples which ostensibly have twice the volume of data in total.

    I've run the latest version of this assembler on these samples already, and see substantially fewer closed genomes (and overall HQ mags!) per sample than the pooled samples, as expected. I aim to do numerous more experiments with some of the recent cleaning options and potentially other (graph-aware?) binning tweaks, but I don't expect the overall picture to change much.

    I'm curious to see whether further improvements can be made given the availability of this larger corpus of individual-level human microbiome HiFi data.

    opened by GabeAl 5
  • General question regarding treatment of contained reads

    General question regarding treatment of contained reads

    The manuscript briefly mentions how Hifiasm-meta uses a new method for filtering contained reads. I'm interested in learning about the filtering mechanism here. Could you please share more details of the algorithm ; OR point me to appropriate place in the code. Pasting the text from your manuscript:

    Treatment of contained reads. The standard procedure to construct a string graph discards a read contained in a longer read. This may lead to an assembly gap if the contained read and the longer read actually reside on different haplotypes10. The original hifiasm patches such gaps by rescuing contained reads after graph construction. Hifiasm-meta tries to resolve the issue before graph construction instead. It retains a contained read if other reads exactly overlapping with the read are inferred to come from different haplotypes. In other words, hifiasm-meta only drops a contained read if there are no other similar haplotypes around it. This strategy often retains extra contained reads that are actually redundant. These extra reads usually lead to bubble-like subgraphs and are later removed by the bubble popping algorithm in the original hifiasm.

    I wish to understand the exact condition / threshold values which decides whether to retain the contained read.

    Thank you.

    opened by cjain7 4
  • fail to Write GFA file

    fail to Write GFA file

    Hi xfengenfx

    recently,i use the hifiam-meta to assemble my metagenomic HIFI data,i encountered same error in two times at two compute cluster,which shows stop at the Writing GFA step suddenly. here is my two log file, the first one was in the slurm system,the second one was in the usual system. so i can't get my final contig GFA file ,can you figure it out for me. job-26237_1.err.txt nohup.out.txt appreciate it

    opened by lonestarling 4
  • Understanding which reads contribute to contigs

    Understanding which reads contribute to contigs

    Hi Xiaowen, I am wondering if it is possible to obtain a list of reads that contribute to each contig in the assembly?

    This seems like it would be highly valuable for metagenomics, as it can identify all reads associated with specific bacterial genomes. In addition, it would be extremely valuable for a more specific use-case I describe below.

    I am working on a problem where I am trying to assemble an endosymbiont bacteria from a larger HiFi dataset focused on the host organism. The assembly of the full dataset with hifiasm did not produce a complete bacteria contig, it was present as several smaller contigs. I attempted to re-assemble and improve the quality of these results. To accomplish this, I have:

    1. Mapped contigs from a hifiasm assembly of the full dataset to a reference of the target bacteria, to identify and extract relevant bacteria contigs.
    2. Mapped reads to those putative bacteria contigs to identify reads that are most likely target bacteria, and extract them.
    3. Performed assembly with this subset of putative bacteria reads using hifiasm-meta.

    This resulted in a complete, circular genome for the target bacteria, along with a few small tangled contigs, suggesting the approach worked pretty well. The small contigs in the new assembly are likely some combination of host reads and perhaps strain variation.

    The genome has a few frameshifts and I would like to try polishing it using only the reads that were used to build the complete bacteria contig. I have used minimap2 to align the subset of reads to this contig, and there are several short regions in which some proportion of reads map poorly (alignments are <1000 bp and they are being hard clipped >3000 bp on each side). I think these are potentially host reads. I can filter these out using samclip, but it would be helpful to know whether or not they were used to construct this contig, and therefore deserve to be excluded.

    Given metagenomic assemblies often result in several complete genomes, I think the same topic will come up. Polishing would also be desirable, but problematic read alignments would be more prevalent due to more species, shared repeats, etc. Having the ability to assign reads to particular contigs would be a tremendous help here too.

    Any advice would be greatly appreciated!

    Thanks, Dan

    opened by dportik 4
  • Hi-C integration?

    Hi-C integration?

    Hi, are you back-porting (up-porting? side-porting?) the Hi-C integration from hifiasm? We are sequencing some species where up to half the sample might be bacteria and fungi (the target species is a plant), and are considering using hifiasm-meta for this as the first step, and then mapping and extracting the plant-specific reads for a separate assembly with regular hifiasm. We are also getting Hi-C reads for these samples, so I wondered if Hi-C integration might be helpful for separating species in hifiasm-meta.

    Sincerely, Ole

    opened by olekto 3
  • gfa s-line

    gfa s-line

    Hello,

    Could you please explain more about the S-line of the noseq.gfa file. What does "dp" and "ts" represent for, respectively.

    thank you.

    opened by liushanlin 2
  • Is it necessary to conduct binning after assembly with HiFi reads to get MAGs?

    Is it necessary to conduct binning after assembly with HiFi reads to get MAGs?

    Hello, xfengnefx!

    With NGS shotgun reads, to get MAGs we usually assemble pair-end reads into contigs, and then recover MAGs through binning.

    What I want to ask is that for HiFi reads, in order to get MAGs with higher quality whether it is necessary to conduct binning after we get contigs using hifiasm-meta?

    Thanks for your helping.

    opened by ye00ye 2
  • HiFi reads: Is it better to perform assembly before taxonomic and functional identification?

    HiFi reads: Is it better to perform assembly before taxonomic and functional identification?

    Hello

    I am a beginner and I have a question about metagenomic analysis using HiFi PacBio long reads. In short read metagenomics I have seen in some papers who suggest doing taxonomic and functional profiling after assembly, to increase the precision. I was wondering if with long reads we can directly use the raw reads for profiling or it is still better to perform assembly first.

    Thank you

    opened by PeymanDerik 4
  • redundancy of hifiasm-meta and metaflye

    redundancy of hifiasm-meta and metaflye

    hello

    i test assembly efficiency of hifiasm-meta and metaflye with mock communty (MSA 1003).

    For f5bcb58692924cb7_1 (ATCC-12228 , len: 2503245 bp), hifiasm-meta got 544 contigs, the longest one is 2387482 bp, and the others are shorter than 30000 bp. when I mapped these contigs to the reference genome, I found high redundancy among these contigs, especially the longest contig included lots of shorter contigs. On the other hand, metaflye got one contig, and exactly the length of the reference genome. But for 5964adb8d0df4fde_1 (ATCC-33323, len: 1854273), hifiasm-meta got 8 contigs, the coverage is good and almost no overlap existed among these 8 contigs.

    So, i want to ask : 1, why different assembly results appeared for different reference genome; 2, how should I set parameters to get a set of contigs with low redundancy while maintaining high coverage.

    the current parameters i set was: hifiasm_meta -t 36 --force-rs -o mock2 ../mock2.fastq.gz

    thanks for your help

    opened by ye00ye 7
  • Duplicate GFA links

    Duplicate GFA links

    Hello

    It seems there are duplicate edges in the produced GFA. What is the purpose of these?

    E.g. if we'd take sheepB.hifiasm-meta.a_ctg.gfa.gz then we'll end with:

    L       s0.ctg000590l   +       s0.ctg027907l   -       10632M  L1:i:29150
    ...
    L       s0.ctg027907l   +       s0.ctg000590l   -       10637M  L1:i:14435
    

    Note that overlaps are different as well which does look suspicious...

    opened by asl 1
Releases(hamtv0.3)
PyTorch implementation HoroPCA: Hyperbolic Dimensionality Reduction via Horospherical Projections

HoroPCA This code is the official PyTorch implementation of the ICML 2021 paper: HoroPCA: Hyperbolic Dimensionality Reduction via Horospherical Projec

HazyResearch 52 Nov 14, 2022
Large Scale Fine-Grained Categorization and Domain-Specific Transfer Learning. CVPR 2018

Large Scale Fine-Grained Categorization and Domain-Specific Transfer Learning Tensorflow code and models for the paper: Large Scale Fine-Grained Categ

Yin Cui 187 Oct 01, 2022
This is the code repository for the paper A hierarchical semantic segmentation framework for computer-vision-based bridge column damage detection

Bridge-damage-segmentation This is the code repository for the paper A hierarchical semantic segmentation framework for computer-vision-based bridge c

Jingxiao Liu 5 Dec 07, 2022
The Official Repository for "Generalized OOD Detection: A Survey"

Generalized Out-of-Distribution Detection: A Survey 1. Overview This repository is with our survey paper: Title: Generalized Out-of-Distribution Detec

Jingkang Yang 338 Jan 03, 2023
Official pytorch implementation of "Scaling-up Disentanglement for Image Translation", ICCV 2021.

Official pytorch implementation of "Scaling-up Disentanglement for Image Translation", ICCV 2021.

Aviv Gabbay 41 Nov 29, 2022
dyld_shared_cache processing / Single-Image loading for BinaryNinja

Dyld Shared Cache Parser Author: cynder (kat) Dyld Shared Cache Support for BinaryNinja Without any of the fuss of requiring manually loading several

cynder 76 Dec 28, 2022
Code base for the paper "Scalable One-Pass Optimisation of High-Dimensional Weight-Update Hyperparameters by Implicit Differentiation"

This repository contains code for the paper Scalable One-Pass Optimisation of High-Dimensional Weight-Update Hyperparameters by Implicit Differentiati

8 Aug 28, 2022
Simulation environments for the CrazyFlie quadrotor: Used for Reinforcement Learning and Sim-to-Real Transfer

Phoenix-Drone-Simulation An OpenAI Gym environment based on PyBullet for learning to control the CrazyFlie quadrotor: Can be used for Reinforcement Le

Sven Gronauer 8 Dec 07, 2022
Bottleneck Transformers for Visual Recognition

Bottleneck Transformers for Visual Recognition Experiments Model Params (M) Acc (%) ResNet50 baseline (ref) 23.5M 93.62 BoTNet-50 18.8M 95.11% BoTNet-

Myeongjun Kim 236 Jan 03, 2023
Custom IMDB Dataset is extracted between 2020-2021 and custom distilBERT model is trained for movie success probability prediction

IMDB Success Predictor Project involves Web Scraping custom IMDB data between 2020 and 2021 of 10000 movies and shows sorted by number of votes ,fine

Gautam Diwan 1 Jan 18, 2022
This is an official implementation for "Exploiting Temporal Contexts with Strided Transformer for 3D Human Pose Estimation".

Exploiting Temporal Contexts with Strided Transformer for 3D Human Pose Estimation This repo is the official implementation of Exploiting Temporal Con

Vegetabird 241 Jan 07, 2023
Code and Data for NeurIPS2021 Paper "A Dataset for Answering Time-Sensitive Questions"

Time-Sensitive-QA The repo contains the dataset and code for NeurIPS2021 (dataset track) paper Time-Sensitive Question Answering dataset. The dataset

wenhu chen 35 Nov 14, 2022
[AAAI 2022] Separate Contrastive Learning for Organs-at-Risk and Gross-Tumor-Volume Segmentation with Limited Annotation

A paper Introduction This is an official release of the paper Separate Contrastive Learning for Organs-at-Risk and Gross-Tumor-Volume Segmentation wit

Jiacheng Wang 14 Dec 08, 2022
[ICLR 2021] Rank the Episodes: A Simple Approach for Exploration in Procedurally-Generated Environments.

[ICLR 2021] RAPID: A Simple Approach for Exploration in Reinforcement Learning This is the Tensorflow implementation of ICLR 2021 paper Rank the Episo

Daochen Zha 48 Nov 21, 2022
KITTI-360 Annotation Tool is a framework that developed based on python(cherrypy + jinja2 + sqlite3) as the server end and javascript + WebGL as the front end.

KITTI-360 Annotation Tool is a framework that developed based on python(cherrypy + jinja2 + sqlite3) as the server end and javascript + WebGL as the front end.

86 Dec 12, 2022
House_prices_kaggle - Predict sales prices and practice feature engineering, RFs, and gradient boosting

House Prices - Advanced Regression Techniques Predicting House Prices with Machine Learning This project is build to enhance my knowledge about machin

Gurpreet Singh 1 Jan 01, 2022
A library of multi-agent reinforcement learning components and systems

Mava: a research framework for distributed multi-agent reinforcement learning Table of Contents Overview Getting Started Supported Environments System

InstaDeep Ltd 463 Dec 23, 2022
This repo is a PyTorch implementation for Paper "Unsupervised Learning for Cuboid Shape Abstraction via Joint Segmentation from Point Clouds"

Unsupervised Learning for Cuboid Shape Abstraction via Joint Segmentation from Point Clouds This repository is a PyTorch implementation for paper: Uns

Kaizhi Yang 42 Dec 09, 2022
Good Classification Measures and How to Find Them

Good Classification Measures and How to Find Them This repository contains supplementary materials for the paper "Good Classification Measures and How

Yandex Research 7 Nov 13, 2022
Pytorch implementation of Rosca, Mihaela, et al. "Variational Approaches for Auto-Encoding Generative Adversarial Networks."

alpha-GAN Unofficial pytorch implementation of Rosca, Mihaela, et al. "Variational Approaches for Auto-Encoding Generative Adversarial Networks." arXi

Victor Shepardson 78 Dec 08, 2022