Lossless indexing with counting de Bruijn graphs

Karasikov, Mikhail; Mustafa, Harun; Rätsch, Gunnar; Kahles, André

doi:10.1101/gr.276607.122

Cited by 19 publications

(45 citation statements)

References 51 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…For compatibility with minigraph, we use GetBlunted to derive a variation graph from the De Bruijn graph [17]. As an additional evaluation, we align the test reads to the original virus reference genomes using the TCG-Aligner [33] (the basis for MetaGraph-LA) to determine reference values for alignment accuracy during our experiments. See Supplementary Table A1 for statistics about these graphs.…”

Section: Evaluation Methodologymentioning

confidence: 99%

“…A chain is a series of anchors that appear in the correct order with respect to the query such that each anchor can reach the subsequent anchor in the chain via graph traversal. A chain is scored more favorably if it contains more anchors and is penalized if the distances between the anchors in the query differ from their corresponding graph traversal distances [39,33,2].…”

Section: Sequence-to-graph Alignmentmentioning

confidence: 99%

“…Our algorithm for computing label-consistent alignments, called MetaGraph-LA, is a seed-and-extend algorithm incorporating methods from MetaGraph-Align [32] and the TCG-Aligner [33]. Given a seed length l ≤ k and a query sequence Q, we extract each l-mer from Q and find all graph nodes with matching l-length suffixes and fetch their corresponding labels.…”

Section: Metagraph-la: Generating Anchors For Mla Via Label-consisten...mentioning

confidence: 99%

“…After seed anchoring, we extend each anchor forward and backward using the same strategy as the TCG-Aligner [33], albeit along label-consistent walks. The alignments terminate after reaching a query end, or after satisfying one of the termination criteria described by Karasikov, Mustafa, et al [33,32].…”

Section: Metagraph-la: Generating Anchors For Mla Via Label-consisten...mentioning

confidence: 99%

See 3 more Smart Citations

Label-guided seed-chain-extend alignment on annotated De Bruijn graphs

Mustafa

Karasikov

Rätsch

et al. 2022

Preprint

Self Cite

View full text Add to dashboard Cite

The amount of data stored in genomic sequence databases is growing exponentially, far exceeding traditional indexing strategies' processing capabilities. Many recent indexing methods organize sequence data into a sequence graph to succinctly represent large genomic data sets from reference genome and sequencing read set databases. These methods typically use De Bruijn graphs as the graph model or the underlying index model, with auxiliary graph annotation data structures to associate graph nodes with various metadata. Examples of metadata can include a node's source samples (called labels), genomic coordinates, expression levels, etc. An important property of these graphs is that the set of sequences spelled by graph walks is a superset of the set of input sequences. Thus, when aligning to graphs indexing samples derived from low-coverage sequencing sets, sequence information present in many target samples can compensate for missing information resulting from a lack of sequence coverage. Aligning a query against an entire sequence graph (as in traditional sequence-to-graph alignment) using state-of-the-art algorithms can be computationally intractable for graphs constructed from thousands of samples, potentially searching through many non-biological combinations of samples before converging on the best alignment. To address this problem, we propose a novel alignment strategy called multi-label alignment (MLA) and an algorithm implementing this strategy using annotated De Bruijn graphs within the MetaGraph framework, called MetaGraph-MLA. MLA extends current sequence alignment scoring models with additional label change operations for incorporating mixtures of samples into an alignment, penalizing mixtures that are dissimilar in their sequence content. To overcome disconnects in the graph that result from a lack of sequencing coverage, we further extend our graph index to utilize a variable-order De Bruijn graph and introduce node length change as an operation. In this model, traversal between nodes that share a suffix of length < k-1 acts as a proxy for inserting nodes into the graph. MetaGraph-MLA constructs an MLA of a query by chaining single-label alignments using sparse dynamic programming. We evaluate MetaGraph-MLA on simulated data against state-of-the-art sequence-to-graph aligners. We demonstrate increases in alignment lengths for simulated viral Illumina-type (by 6.5%), PacBio CLR-type (by 6.2%), and PacBio CCS-type (by 6.7%) sequencing reads, respectively, and show that the graph walks incorporated into our MLAs originate predominantly from samples of the same strain as the reads' ground-truth samples. We envision MetaGraph-MLA as a step towards establishing sequence graph tools for sequence search against a wide variety of target sequence types.

show abstract

Section: Evaluation Methodologymentioning

confidence: 99%

Section: Sequence-to-graph Alignmentmentioning

confidence: 99%

Section: Metagraph-la: Generating Anchors For Mla Via Label-consisten...mentioning

confidence: 99%

Section: Metagraph-la: Generating Anchors For Mla Via Label-consisten...mentioning

confidence: 99%

See 2 more Smart Citations

Label-guided seed-chain-extend alignment on annotated De Bruijn graphs

Mustafa

Karasikov

Rätsch

et al. 2022

Preprint

Self Cite

View full text Add to dashboard Cite

show abstract

“…Since the time complexity of optimal sequence-to-graph alignment grows linearly with the number of edges in the graph [20,16], many approaches instead follow an approximate seed-and-extend strategy [2], which operates in four main steps: i) seed extraction , which in its simplest form involves finding all substrings with a certain length, ii) seed anchoring , finding matching nodes in the graph, iii) seed filtration , often involving clustering [9,37] or co-linear chaining [25,1,32,8] of seeds, and iv) seed extension , involving performing semi-global pairwise sequence alignment forwards and backwards from each anchored seed [28]. We will review the usage of exact seeds utilized in tools such as vg[15] and G raph A ligner [37] and discuss their limitations in a high mutation-rate setting.…”

Section: Introductionmentioning

confidence: 99%

Aligning Distant Sequences to Graphs using Long Seed Sketches

Joudaki

Meterez

Mustafa

et al. 2022

Preprint

Self Cite

View full text Add to dashboard Cite

Sequence-to-graph alignment is an important step in applications such as variant genotyping, read error correction and genome assembly. When a query sequence requires a substantial number of edits to align, approximate alignment tools that follow the seed-and-extend approach require shorter seeds to get any matches. However, in large graphs with high variation, relying on a shorter seed length leads to an exponential increase in spurious matches. We propose a novel seeding approach relying on long inexact matches instead of short exact matches. We demonstrate experimentally that our approach achieves a better time-accuracy trade-off in settings with up to a 25% mutation rate. We achieve this by sketching a subset of graph nodes and storing them in a K-nearest neighbor index. While sketches are more robust to indels, finding the nearest neighbor of a sketch in a high-dimensional space is more computationally challenging than finding exact seeds. We demonstrate that if we store sketch vectors in a K-nearest neighbor index, we can circumvent the curse of dimensionality. Our long sketch-based seed scheme contrasts existing approaches and highlights the important role that tensor sketching can play in bioinformatics applications. Our proposed seeding method and implementation have several advantages: i) We empirically show that our method is efficient and scales to graphs with 1 billion nodes, with time and memory requirements for preprocessing growing linearly with graph size and query time growing quasi-logarithmically with query length. ii) For queries with an edit distance of 25% relative to their length, on the 1 billion node graph, longer sketch-based seeds yield a 4x increase in recall compared to exact seeds. iii) Conceptually, our seeder can be incorporated into other aligners, proposing a novel direction for sequence-to-graph alignment. The implementation is available at: https://github.com/ratschlab/tensor-sketch-alignment.

show abstract