deBWT: parallel construction of Burrows–Wheeler Transform for large collection of genomes with de Bruijn-branch encoding

Liu, Bo; Zhu, Dixian; Wang, Yadong

doi:10.1093/bioinformatics/btw266

Cited by 12 publications

(6 citation statements)

References 31 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…A straightforward way to represent a pangenome is to store unaligned genomes in a full-text index that compresses redundancies in sequences identical between individuals [8][9][10]. We may retrieve individual genomes from the index, inspect the k-mer spectrum and test the presence of k-mers using standard techniques.…”

Section: Introductionmentioning

confidence: 99%

The design and construction of reference pangenome graphs with minigraph

2020

View full text Add to dashboard Cite

The recent advances in sequencing technologies enable the assembly of individual genomes to the quality of the reference genome. How to integrate multiple genomes from the same species and make the integrated representation accessible to biologists remains an open challenge. Here, we propose a graph-based data model and associated formats to represent multiple genomes while preserving the coordinate of the linear reference genome. We implement our ideas in the minigraph toolkit and demonstrate that we can efficiently construct a pangenome graph and compactly encode tens of thousands of structural variants missing from the current reference genome.

show abstract

Section: Introductionmentioning

confidence: 99%

The design and construction of reference pangenome graphs with minigraph

2020

View full text Add to dashboard Cite

show abstract

“…In the configuration stage, programmers can easily specify the basic FindeR parameters, e.g., the BWT and FM-Index files, alphabet, FM-Index bucket width, bank number and RHU number, in the configuration file. We assume the BWT construction of the reference genomes and read pools are done in the cloud [56], [57], so that we can perform trillions of backward searches on them during all steps of genome analysis. At the beginning of compiling, the files of the BWT and FM-Index are copied into ReRAM chips and the other parameters are written into the SMC on the NVDIMM.…”

Section: ) System Supportmentioning

confidence: 99%

FindeR: Accelerating FM-Index-Based Exact Pattern Matching in Genomic Sequences through ReRAM Technology

Zokaee

Zhang

Jiang

2019

2019 28th International Conference on Parallel Architectures and Compilation Techniques (PACT)

View full text Add to dashboard Cite

Genomics is the critical key to enabling precision medicine, ensuring global food security and enforcing wildlife conservation. The massive genomic data produced by various genome sequencing technologies presents a significant challenge for genome analysis. Because of errors from sequencing machines and genetic variations, approximate pattern matching (APM) is a must for practical genome analysis. Recent work proposes FPGA, ASIC and even process-in-memory-based accelerators to boost the APM throughput by accelerating dynamic-programmingbased algorithms (e.g., Smith-Waterman). However, existing accelerators lack the efficient hardware acceleration for the exact pattern matching (EPM) that is an even more critical and essential function widely used in almost every step of genome analysis including assembly, alignment, annotation and compression.State-of-the-art genome analysis adopts the FM-Index that augments the space-efficient BWT with additional data structures permitting fast EPM operations. But the FM-Index is notorious for poor spatial locality and massive random memory accesses. In this paper, we propose a ReRAM-based process-in-memory architecture, FindeR, to enhance the FM-Index EPM search throughput in genomic sequences. We build a reliable and energyefficient Hamming distance unit to accelerate the computing kernel of FM-Index search using commodity ReRAM chips without introducing extra CMOS logic. We further architect a full-fledged FM-Index search pipeline and improve its search throughput by lightweight scheduling on the NVDIMM. We also create a system library for programmers to invoke FindeR to perform EPMs in genome analysis. Compared to state-of-the-art accelerators, FindeR improves the FM-Index search throughput by 83% ∼ 30K× and throughput per Watt by 3.5× ∼ 42.5K×.

show abstract

“…Schemes Based on Burrows-Wheeler Transform Various works incorporate the Burrows-Wheeler Transform for more space efficiency [37,196,304,390] 4.3.2 Grammar-and Text-Related Works. Peshkin [367] uses the notions from both graph grammars and graph compression to understand the structure of DNA and simultaneously be able to represent it compactly.…”

Section: Schemes Based On De Bruijn Graphs De Bruijn Graphmentioning

confidence: 99%

Survey and Taxonomy of Lossless Graph Compression and Space-Efficient Graph Representations

Besta,

Hoefler

2018

Preprint

View full text Add to dashboard Cite

Various graphs such as web or social networks may contain up to trillions of edges. Compressing such datasets can accelerate graph processing by reducing the amount of I/O accesses and the pressure on the memory subsystem. Yet, selecting a proper compression method is challenging as there exist a plethora of techniques, algorithms, domains, and approaches in compressing graphs. To facilitate this, we present a survey and taxonomy on lossless graph compression that is the first, to the best of our knowledge, to exhaustively analyze this domain. Moreover, our survey does not only categorize existing schemes, but also explains key ideas, discusses formal underpinning in selected works, and describes the space of the existing compression schemes using three dimensions: areas of research (e.g., compressing web graphs), techniques (e.g., gap encoding), and features (e.g., whether or not a given scheme targets dynamic graphs). Our survey can be used as a guide to select the best lossless compression scheme in a given setting.

show abstract

deBWT: parallel construction of Burrows–Wheeler Transform for large collection of genomes with de Bruijn-branch encoding

Cited by 12 publications

References 31 publications

The design and construction of reference pangenome graphs with minigraph

The design and construction of reference pangenome graphs with minigraph

FindeR: Accelerating FM-Index-Based Exact Pattern Matching in Genomic Sequences through ReRAM Technology

Survey and Taxonomy of Lossless Graph Compression and Space-Efficient Graph Representations

Contact Info

Product

Resources

About