deBGA: read alignment with de Bruijn graph-based seed and extension

Liu, Bo; Guo, Hongzhe; Brudno, Michael; Wang, Yadong

doi:10.1093/bioinformatics/btw371

Cited by 91 publications

(119 citation statements)

References 32 publications

Supporting

Mentioning

119

Contrasting

Order By: Relevance

“…These sequences are then used for realignment with the succinct self-index-based BWA backtrack, in order to produce a final set of read alignments. More broadly, de Bruijn graph-based tools, such as deBGA, are marked as using k-mer-based indexes, because the nodes in a de Bruijn graph are identified and looked up by k-mers (Liu et al 2016). …”

Section: Indexingmentioning

confidence: 99%

“…The PRG system and deBGA do paired-end resolution in the space of individual sequences: the generated pair of sequences used for BWA realignment in PRG, and the linear reference sequences embedded in the de Bruijn graph in deBGA (Dilthey et al 2015;Liu et al 2016). A graph distance metric is used for paired-end resolution in vg, which can serve as an example of that approach, although the implementation does not currently consider the relative orientations of paired reads (E Garrison, J Sirén, AM Novak, G Hickey, JM Eizenga, ET Dawson, W Jones, OJ Buske, MF Lin, B Paten, et al, in prep.).…”

Section: Genome Graphsmentioning

confidence: 99%

See 1 more Smart Citation

Genome graphs and the evolution of genome inference

et al. 2017

View full text Add to dashboard Cite

The human reference genome is part of the foundation of modern human biology and a monumental scientific achievement. However, because it excludes a great deal of common human variation, it introduces a pervasive reference bias into the field of human genomics. To reduce this bias, it makes sense to draw on representative collections of human genomes, brought together into reference cohorts. There are a number of techniques to represent and organize data gleaned from these cohorts, many using ideas implicitly or explicitly borrowed from graph-based models. Here, we survey various projects underway to build and apply these graph-based structures—which we collectively refer to as genome graphs—and discuss the improvements in read mapping, variant calling, and haplotype determination that genome graphs are expected to produce.

show abstract

Section: Indexingmentioning

confidence: 99%

Section: Genome Graphsmentioning

confidence: 99%

Genome graphs and the evolution of genome inference

et al. 2017

View full text Add to dashboard Cite

show abstract

“…Motivated by these technical problems and existing short RNA-seq read alignment algorithms [26,33], deSALT uses a two-pass approach to align the noisy long reads (a schematic illustration is in Figure 1). In the first pass, it employs a graph-based genome index [34] to find match blocks (MBs) between the read and the reference and uses a sparse dynamic programming (SDP) approach to compose the MBs into alignment skeletons (referred to as the "alignment skeleton generation" step). All the alignment skeletons of all the reads are then integrated to comprehensively detect the exon regions (referred to as the "exon inference" step).…”

Section: Overview Of the Desalt Approachmentioning

confidence: 99%

“…deSALT aligns input reads in three major steps as follows: 1) Alignment skeleton generation (first-pass alignment): for each of the reads, deSALT uses the RdBG-index [34] to find the maximal exact matches between the unitigs of a reference de Buijn graph (RdBG) and the read (termed as U-MEMs) and to build one or more alignment skeletons using an SDP approach.…”

Section: Steps Of the Desalt Approachmentioning

confidence: 99%

deSALT: fast and accurate long transcriptomic read alignment with de Bruijn graph-based index

Liu

Zang

et al. 2019

Preprint

Self Cite

View full text Add to dashboard Cite

Long-read RNA sequencing (RNA-seq) is promising to transcriptomics studies, however, the alignment of long RNA-seq reads is still non-trivial due to high sequencing errors and complicated gene structures. Herein, we propose deSALT, a tailored two-pass alignment approach, which constructs graph-based alignment skeletons to infer exons and uses them to generate spliced reference sequences to produce refined alignments. deSALT addresses several difficult technical issues, such as small exons and sequencing errors, which breakthroughs the bottlenecks of long RNA-seq read alignment. Benchmarks demonstrate that deSALT has a greater ability to produce accurate and homogeneous full-length alignments. deSALT is available at: https://github.com/hitbc/deSALT. Keywords long read alignment, RNA-seq, de Bruijn graph-based index 3 Background RNA sequencing (RNA-seq) has become a fundamental approach to characterize transcriptomes.It reveals precise gene structures and quantifies gene/transcript expressions [1][2][3][4][5] in various applications, such as variant calling [6], RNA editing analysis [7, 8], and gene fusion detection [9, 10].However, current widely used short read sequencing technologies have limited read length and systematic bias from library preparation. These drawbacks limit more accurate alignment [11] and precise gene isoform analysis [12], thus creating a bottleneck for transcriptomic studies.Two kinds of long read sequencing technologies, i.e., single molecule real time (SMRT) sequencing produced by Pacific Biosciences (PacBio) [13] and nanopore sequencing produced by Oxford Nanopore Technologies (ONT) [14], are emerging and promising to breakthrough the bottleneck of short reads in transcriptomic analysis. Both of them enable the production of much longer reads, the mean and maximum lengths of the reads being over ten to hundreds of thousands of base pairs (bp) [15,16], respectively. Taking this advantage, full-length transcripts can be sequenced by single reads, which is promising for substantially improving the accuracy of gene isoform reconstruction. Furthermore, there is less systematic bias in the sequencing procedure [17], which is also beneficial to gene/transcript expression quantification.Besides their advantages, PacBio and ONT reads have much higher sequencing error rates than that of short reads. For PacBio SMRT sequencing, the sequencing error rate of raw reads ("subreads")is about 10% to 20% [16]; for ONT nanopore sequencing, the sequencing error rates of 1D and 2D (also known as 1D 2 ) reads are about 25% and 12% [18,19], respectively. PacBio SMRT platforms can produce reads of inserts (ROIs) by sequencing circular fragments multiple times to largely reduce sequencing errors. However, this technology has lower sequencing yields and reduced read lengths. Therefore, these high sequencing errors raise new technical challenges for RNA-seq data analysis.Read alignment could be the most affected one, and the effect may not be limited to the read alignment itself since it is fundamental to many down...

show abstract

“…Additionally, for genome identification of reads with an unknown origin in a metagenomics study, reads can be aligned to a de Bruijn graph that is built from multiple genomes. Recently, two standalone tools have been proposed to align short Illumina reads to de Bruijn graphs: BGREAT [10] and deBGA [11]. …”

Section: Introductionmentioning

confidence: 99%

BrownieAligner: accurate alignment of Illumina sequencing data to de Bruijn graphs

et al. 2018

View full text Add to dashboard Cite

BackgroundAligning short reads to a reference genome is an important task in many genome analysis pipelines. This task is computationally more complex when the reference genome is provided in the form of a de Bruijn graph instead of a linear sequence string.ResultsWe present a branch and bound alignment algorithm that uses the seed-and-extend paradigm to accurately align short Illumina reads to a graph. Given a seed, the algorithm greedily explores all branches of the tree until the optimal alignment path is found. To reduce the search space we compute upper bounds to the alignment score for each branch and discard the branch if it cannot improve the best solution found so far. Additionally, by using a two-pass alignment strategy and a higher-order Markov model, paths in the de Bruijn graph that do not represent a subsequence in the original reference genome are discarded from the search procedure.ConclusionsBrownieAligner is applied to both synthetic and real datasets. It generally outperforms other state-of-the-art tools in terms of accuracy, while having similar runtime and memory requirements. Our results show that using the higher-order Markov model in BrownieAligner improves the accuracy, while the branch and bound algorithm reduces runtime. BrownieAligner is written in standard C++11 and released under GPL license. BrownieAligner relies on multithreading to take advantage of multi-core/multi-CPU systems. The source code is available at: https://github.com/biointec/browniealignerElectronic supplementary materialThe online version of this article (10.1186/s12859-018-2319-7) contains supplementary material, which is available to authorized users.

show abstract

deBGA: read alignment with de Bruijn graph-based seed and extension

Abstract: deBGA is available at: https://github.com/hitbc/deBGA CONTACT: ydwang@hit.edu.cnSupplementary information: Supplementary data are available at Bioinformatics online.

Cited by 91 publications

References 32 publications

Genome graphs and the evolution of genome inference

Genome graphs and the evolution of genome inference

deSALT: fast and accurate long transcriptomic read alignment with de Bruijn graph-based index

BrownieAligner: accurate alignment of Illumina sequencing data to de Bruijn graphs

Contact Info

Product

Resources

About