BLEND: A Fast, Memory-Efficient, and Accurate Mechanism to Find Fuzzy Seed Matches

Last update: Dec 26, 2022

Related tags

Overview

BLEND: A Fast, Memory-Efficient, and Accurate Mechanism to Find Fuzzy Seed Matches

BLEND is a mechanism that can efficiently find fuzzy seed matches between sequences to significantly improve the performance and accuracy while reducing the memory space usage of two important applications: 1) finding overlapping reads and 2) read mapping. Finding fuzzy seed matches enable BLEND to find both 1) exact-matching seeds and 2) highly similar seeds. We integrate the BLEND mechanism into Minimap2. We make the following changes in the original Minimap2 implementation:

We enable the Minimap2 implementation so that it can find fuzzy seed matches using the BLEND mechanism as the original implementation can only find the exact-matching seeds between sequences. To this end, we change the sketch.c implementation of Minimap2 so that 1) we can generate the seeds that BLEND finds and 2) generate the hash values for seeds to find fuzzy seed matches.
We enable the Minimap2 implementation to use seeds longer than 256 bases so that it can store longer seeds when using BLEND by combining the minimizer k-mer with many neighbor k-mers (e.g., hundreds), if necessary. The current implementation of Minimap2 allocates 8-bits to store seed lengths up to 256 characters. We change this requirement in various places of the implementation (e.g., line 112 in sketch.c and line 239 in index.c) so that BLEND can use 14 bits to store seed lengths up to 16384 characters. We do this because BLEND merges many k-mers into a single seed, which may be much larger than a 256 character-long sequence.
We disable filtering out the minimizer k-mers (i.e., seeds in BLEND's case) based on their number of maximum occurence. We do this because BLEND enables generating the same hash value for similar seeds, which may lead to many hash values above the maximum threshold. We do not oppose enabling this filtering mechanism, but it requires further investigation on how to set this threshold value for different parameter settings in BLEND. Thus, filtering out the seeds that occur more than X times is a future work for BLEND so that we can define the value X without reducing the accuracy of BLEND.

Cloning the source code

Download the code from its GitHub repository:

git clone https://github.com/CMU-SAFARI/BLEND.git blend

Alternatively, if you would like to compile the SIMD-compatible version of BLEND, you can clone BLEND with its simde submodule:

git clone --recurse-submodules https://github.com/CMU-SAFARI/BLEND.git blend

Compiling from the source code

Compilation process is similar to Minimap2's compilation as also explained in more detail here. We keep the support for using the SIMD instructions that Minimap2 implements.

Before compiling BLEND:

Make sure you have a C compiler and GNU make,

To compile:

cd blend && make

To compile the SIMD-compatible version:

cd blend && make simd

If the compilation is successful, the binary called blend will be located under bin.

Usage

You can print the help message to learn how to use blend:

blend -h

Below we show how to use blend for 1) finding overlapping reads and 2) read mapping when using the default preset parameters for each use application and genome.

BLEND provides the preset parameters depending on:

The application: 1) Finding overlapping reads and 2) read mapping.
Sequencing Technology: 1) Accurate long reads (e.g., PacBio HiFi reads), 2) erroneous long reads (e.g., PacBio CLR reads), and 2) short reads (i.e., Illumina paired-end reads).
Genome: 1) Human, 2) eukaryotic, and 3) bacterial genomes.

Finding Overlapping Reads

Assume that you would like to perform all-vs-all overlapping between all pairs of HiFi reads from a human genome located in file reads.fastq. To find overlapping reads and store them in the PAF file output.paf:

blend -x ava-hifi --genome human reads.fastq reads.fastq > output.paf

Read Mapping

Assume that you would like to map PacBio CLR reads in file reads.fastq to a reference genome in file ref.fasta. To generate the read mapping with the CIGAR output in the SAM file output.sam:

blend -ax map-pb ref.fasta reads.fastq > output.sam

Getting Help

Since we integrate the BLEND mechanism into Minimap2, most portion of the parameters are the same as explained in the man page of Minimap2 or as explained in the public page of minimap2.1, which is subject to change as the new versions of Minamp2 role out. We explain the parameters unique to the BLEND implementation below.

The following option (i.e., neighbors) defines the number of consecutive k-mers that BLEND uses to generate a seed. Thus, if the k-mer length is k, the seed length is neighbors + k - 1. Default value is 10.

--neighbors INT Combines INT amount of k-mers to generate a seed. [10]

The following option (i.e., fixed-bits) defines the number of bits that BLEND uses for a hash value of a seed. By default, it uses 2 bits per character of a k-mer and, thus, 2*k bits for a hash value of a seed. This value can be decreased to increase the collision rate for assigning the same hash values for similar seeds, but also may start assigning the same hash value for slightly dissimilar seeds.

--fixed-bits INT BLEND uses INT number of bits when generating hash values of seeds rather than using 2*k number of bits. Useful when collision rate needs to be decreased than 2*k bits. Setting this option to 0 uses 2*k bits for hash values. [0].

BLEND also provides preset options. Some of these preset options also depend on the genome type as shown below:

-x map-ont (-k15 -w10 --fixed-bits=30 --neighbors=3)
-x ava-ont (-k15 -w20 --fixed-bits=30 --neighbors=3 -e0 -m100 -r2k)
-x map-pb (-Hk15 -w20 --fixed-bits=30 --neighbors=3)
-x ava-pb (-Hk19 -Xw20 --fixed-bits=32 --neighbors=3 -e0 -m100)
-x map-hifi --genome human (-k15 -w500 --fixed-bits=38 --neighbors=100 -U50,500 -g10k -A1 -B4 -O6,26 -E2,1 -s200)
-x map-hifi --genome eukaryote (-k15 -w500 --fixed-bits=30 --neighbors=5 -U50,500 -g10k -A1 -B4 -O6,26 -E2,1 -s200)
-x map-hifi --genome bacteria (-k15 -w500 --fixed-bits=30 --neighbors=3 -U50,500 -g10k -A1 -B4 -O6,26 -E2,1 -s200)
-x ava-hifi --genome human (-k15 -Xw500 --fixed-bits=38 --neighbors=10 -e0 -m100)
-x ava-hifi --genome eukaryote (-k15 -Xw500 --fixed-bits=30 --neighbors=10 -e0 -m100)
-x ava-hifi --genome bacteria (-k15 -Xw500 --fixed-bits=30 --neighbors=5 -e0 -m100)

Replicating the results in the paper

We explain how to replicate the results we produce in the BLEND paper in the test directory.

You might also like...

A lightweight deep network for fast and accurate optical flow estimation.

FastFlowNet: A Lightweight Network for Fast Optical Flow Estimation The official PyTorch implementation of FastFlowNet (ICRA 2021). Authors: Lingtong

161 Jan 3, 2023

A Fast and Accurate One-Stage Approach to Visual Grounding, ICCV 2019 (Oral)

One-Stage Visual Grounding ***** New: Our recent work on One-stage VG is available at ReSC.***** A Fast and Accurate One-Stage Approach to Visual Grou

118 Dec 5, 2022

Code for ACM MM2021 paper "Complementary Trilateral Decoder for Fast and Accurate Salient Object Detection"

CTDNet The PyTorch code for ACM MM2021 paper "Complementary Trilateral Decoder for Fast and Accurate Salient Object Detection" Requirements Python 3.6

28 Oct 20, 2022

Realtime segmentation with ENet, the fast and accurate segmentation net.

Enet This is a realtime segmentation net with almost 22 fps on GTX1080 ti, and the model size is very small with only 28M. This repo contains the infe

14 Aug 30, 2022

Receptive Field Block Net for Accurate and Fast Object Detection, ECCV 2018

Receptive Field Block Net for Accurate and Fast Object Detection By Songtao Liu, Di Huang, Yunhong Wang Updatas (2021/07/23): YOLOX is here!, stronger

1.4k Dec 21, 2022

Python implementation of MULTIseq barcode alignment using fuzzy string matching and GMM barcode assignment

Python implementation of MULTIseq barcode alignment using fuzzy string matching and GMM barcode assignment.

2 Feb 11, 2022

Everything you want about DP-Based Federated Learning, including Papers and Code. (Mechanism: Laplace or Gaussian, Dataset: femnist, shakespeare, mnist, cifar-10 and fashion-mnist. )

Differential Privacy (DP) Based Federated Learning (FL) Everything about DP-based FL you need is here. （所有你需要的DP-based FL的信息都在这里） Code Tip: the code o

83 Dec 24, 2022

Learnable Multi-level Frequency Decomposition and Hierarchical Attention Mechanism for Generalized Face Presentation Attack Detection

LMFD-PAD Note This is the official repository of the paper: LMFD-PAD: Learnable Multi-level Frequency Decomposition and Hierarchical Attention Mechani

28 Dec 2, 2022

Code for the TIP 2021 Paper "Salient Object Detection with Purificatory Mechanism and Structural Similarity Loss"

PurNet Project for the TIP 2021 Paper "Salient Object Detection with Purificatory Mechanism and Structural Similarity Loss" Abstract Image-based salie

4 Aug 25, 2022

Comments

A test of BLEND on two real datasets of PacBio CLR and Nanopore reads
In the paper https://arxiv.org/abs/2112.08687 BLEND was tested on only one non-HiFi read dataset. That was a simulated read dataset for one of the smallest eukaryotic genomes — the genome of Saccharomyces cerevisiae.

To test how well BLEND performs on real (non-simulated) datasets of genomes which have more typical sizes, I used it to assemble genomes from these two sets of reads:

Caenorhabditis elegans, PacBio CLR reads used in the article https://www.sciencedirect.com/science/article/pii/S2589004220305770 . For polishing I also used Illumina reads from that article. The nematode genome size is approximately 100 Mbp.

Arabidopsis thaliana, Nanopore reads https://www.ncbi.nlm.nih.gov/sra/?term=ERR5530736 . For polishing I also used Illumina reads https://www.ncbi.nlm.nih.gov/sra/?term=ERR2173372 . The size of arabidopsis' genome is approximately 120 Mbp.

I searched for overlaps, then assembled the genomes with Miniasm using default parameters, then polished the assemblies using long reads with Racon, and then polished the assemblies using both long and short reads with HyPo. The assemblies were compared with references using QUAST.

The search for overlaps was performed with Blend 1.0 and, for comparison, with Minimap 2.22, using 22 threads of Intel Xeon X5670.

For the nematode, results are as follows: | | Minimap2 | BLEND | | --- | --- | --- | | Time to find overlaps | 10m | 3h 37m | | Maximum RAM consumption | 20G | 44G | | N50 | 2,056,511 | 1,915,190 | | NGA50 | 589,675 | 563,498 | | misassemblies | 740 | 707 | | Genome fraction | 99.692% | 99.683% | | Total length | 109,516,352 | 108,958,103 |

So, the assemblies of the nematode genome made with Minimap2 and with BLEND are similar. However, Blend required 20x more time to find overlaps and 2x more RAM.

For arabidopsis Minimap found overlaps in 30 minutes using 29G RAM. I terminated BLEND because it didn't finish in 24 hours. At the moment I terminated it, BLEND was using 300G RAM.

So, it seems that on non-HiFi datasets for genomes not as small as the genome of Saccharomyces cerevisiae BLEND is slower than Minimap2 and uses more RAM. This may be so because BLEND doesn't deal efficiently with repetitive seeds.
opened by shelkmike 2
Some questions about the article
Could you please answer some questions about the article (https://arxiv.org/pdf/2112.08687.pdf):

For HiFi reads you used Minimap2 with the option --ava-pb that is intended for PacBio CLR reads and not PacBio HiFi reads (Table S1). Why didn't you try Minimap2 with some other parameters? For example you could have increased the window size and the minimizer size. I suppose this will make Minimap2 faster and decrease its RAM consumption, thus reducing the difference between BLEND and Minimap2 on HiFi reads.

Why did you use N50 and not NGA50 (Table 2)? N50 may be inflated due to misassemblies that result in improper sequence junctions.

Why did you measure k-mer completeness and average identity using unpolished assemblies (Table 3)? Miniasm assemblies require polishing, because the accuracy of its contigs is the same as the accuracy of the reads used for the assembly. The higher accuracy of BLEND in Table 3 means that contigs made with BLEND are composed of slightly more accurate reads than contigs made with Minimap2, but the difference in accuracy may disappear after polishing.

Taking into account that you used only one non-HiFi long read dataset and BLEND performed on it worse than Minimap2 (N50 in Table 2), is it correct to say that BLEND is probably fit only for HiFi long reads, and not PacBio CLR or Nanopore reads?

With best wishes, Mikhail Schelkunov
opened by shelkmike 1

Releases(v.1.0.0)

v.1.0.0(Nov 22, 2022)

The initial release of BLEND.
Source code(tar.gz)
Source code(zip)
blend-1.0.0-x86_64-Darwin.tar.gz(139.21 KB)
blend-1.0.0-x86_64-linux.tar.gz(507.82 KB)
blend-1.0.0.tar.gz(122.90 KB)

Owner

SAFARI Research Group at ETH Zurich and Carnegie Mellon University

Site for source code and tools distribution from SAFARI Research Group at ETH Zurich and Carnegie Mellon University.

GitHub Repository

Evolutionary Population Curriculum for Scaling Multi-Agent Reinforcement Learning

Evolutionary Population Curriculum for Scaling Multi-Agent Reinforcement Learning This is the code for implementing the MADDPG algorithm presented in

97 Dec 21, 2022

🔥🔥High-Performance Face Recognition Library on PaddlePaddle & PyTorch🔥🔥

face.evoLVe: High-Performance Face Recognition Library based on PaddlePaddle & PyTorch Evolve to be more comprehensive, effective and efficient for fa

3.1k Jan 02, 2023

Source code for the paper "PLOME: Pre-training with Misspelled Knowledge for Chinese Spelling Correction" in ACL2021

PLOME:Pre-training with Misspelled Knowledge for Chinese Spelling Correction (ACL2021) This repository provides the code and data of the work in ACL20

197 Nov 26, 2022

FinGAT: A Financial Graph Attention Networkto Recommend Top-K Profitable Stocks

FinGAT: A Financial Graph Attention Networkto Recommend Top-K Profitable Stocks This is our implementation for the paper: FinGAT: A Financial Graph At

64 Dec 13, 2022

PyTorch implementation of UPFlow (unsupervised optical flow learning)

UPFlow: Upsampling Pyramid for Unsupervised Optical Flow Learning By Kunming Luo, Chuan Wang, Shuaicheng Liu, Haoqiang Fan, Jue Wang, Jian Sun Megvii

87 Dec 20, 2022

This project provides a stock market environment using OpenGym with Deep Q-learning and Policy Gradient.

Stock Trading Market OpenAI Gym Environment with Deep Reinforcement Learning using Keras Overview This project provides a general environment for stoc

769 Dec 25, 2022

Explicable Reward Design for Reinforcement Learning Agents [NeurIPS'21]

3 May 12, 2022

【steal piano】GitHub偷情分析工具！

【steal piano】GitHub偷情分析工具！你是否有这样的困扰，有一天你的仓库被很多人加了star，但是你却不知道这些人都是从哪来的？别担心，GitHub偷情分析工具帮你轻松解决问题！原理 GitHub偷情分析工具透过分析star的时间以及他们之间的follow关系，可以推测出每个st

442 Dec 21, 2022

Code for reproducing experiments in "Improved Training of Wasserstein GANs"

Improved Training of Wasserstein GANs Code for reproducing experiments in "Improved Training of Wasserstein GANs". Prerequisites Python, NumPy, Tensor

2.2k Jan 01, 2023

Orthogonal Jacobian Regularization for Unsupervised Disentanglement in Image Generation (ICCV 2021)

Orthogonal Jacobian Regularization for Unsupervised Disentanglement in Image Generation Home | PyTorch BigGAN Discovery | TensorFlow ProGAN Regulariza

54 Dec 30, 2022

Sum-Product Probabilistic Language

Sum-Product Probabilistic Language SPPL is a probabilistic programming language that delivers exact solutions to a broad range of probabilistic infere

57 Nov 17, 2022

[CVPR 2021] MiVOS - Mask Propagation module. Reproduced STM (and better) with training code :star2:. Semi-supervised video object segmentation evaluation.

MiVOS (CVPR 2021) - Mask Propagation Ho Kei Cheng, Yu-Wing Tai, Chi-Keung Tang [arXiv] [Paper PDF] [Project Page] [Papers with Code] This repo impleme

106 Jan 03, 2023

BLEND: A Fast, Memory-Efficient, and Accurate Mechanism to Find Fuzzy Seed Matches

Related tags

Overview

BLEND: A Fast, Memory-Efficient, and Accurate Mechanism to Find Fuzzy Seed Matches

Cloning the source code

Compiling from the source code

Usage

Finding Overlapping Reads

Read Mapping

Getting Help

Replicating the results in the paper

You might also like...

A lightweight deep network for fast and accurate optical flow estimation.

A Fast and Accurate One-Stage Approach to Visual Grounding, ICCV 2019 (Oral)

Code for ACM MM2021 paper "Complementary Trilateral Decoder for Fast and Accurate Salient Object Detection"

Realtime segmentation with ENet, the fast and accurate segmentation net.

Receptive Field Block Net for Accurate and Fast Object Detection, ECCV 2018

Python implementation of MULTIseq barcode alignment using fuzzy string matching and GMM barcode assignment

Everything you want about DP-Based Federated Learning, including Papers and Code. (Mechanism: Laplace or Gaussian, Dataset: femnist, shakespeare, mnist, cifar-10 and fashion-mnist. )

Learnable Multi-level Frequency Decomposition and Hierarchical Attention Mechanism for Generalized Face Presentation Attack Detection

Code for the TIP 2021 Paper "Salient Object Detection with Purificatory Mechanism and Structural Similarity Loss"

Comments

A test of BLEND on two real datasets of PacBio CLR and Nanopore reads

Some questions about the article

Releases(v.1.0.0)

v.1.0.0(Nov 22, 2022)

Owner

SAFARI Research Group at ETH Zurich and Carnegie Mellon University

Evolutionary Population Curriculum for Scaling Multi-Agent Reinforcement Learning

🔥🔥High-Performance Face Recognition Library on PaddlePaddle & PyTorch🔥🔥

Source code for the paper "PLOME: Pre-training with Misspelled Knowledge for Chinese Spelling Correction" in ACL2021

FinGAT: A Financial Graph Attention Networkto Recommend Top-K Profitable Stocks

PyTorch implementation of UPFlow (unsupervised optical flow learning)

This project provides a stock market environment using OpenGym with Deep Q-learning and Policy Gradient.

Explicable Reward Design for Reinforcement Learning Agents [NeurIPS'21]

【steal piano】GitHub偷情分析工具！

Code for reproducing experiments in "Improved Training of Wasserstein GANs"

Orthogonal Jacobian Regularization for Unsupervised Disentanglement in Image Generation (ICCV 2021)

Sum-Product Probabilistic Language

[CVPR 2021] MiVOS - Mask Propagation module. Reproduced STM (and better) with training code :star2:. Semi-supervised video object segmentation evaluation.

Self-Regulated Learning for Egocentric Video Activity Anticipation

Scaling and Benchmarking Self-Supervised Visual Representation Learning

Code for the CVPR2021 paper "Patch-NetVLAD: Multi-Scale Fusion of Locally-Global Descriptors for Place Recognition"

Just Randoms Cats with python

WebUAV-3M: A Benchmark Unveiling the Power of Million-Scale Deep UAV Tracking

Co-GAIL: Learning Diverse Strategies for Human-Robot Collaboration

Code and real data for the paper "Counterfactual Temporal Point Processes", available at arXiv.

A python implementation of Deep-Image-Analogy based on pytorch.