Co-linear Chaining with Overlaps and Gap Costs

Jain, Chirag; Gibney, Daniel; Thankachan, Sharma V.

doi:10.1007/978-3-031-04749-7_15

Cited by 13 publications

(11 citation statements)

References 32 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…We only use k-mer seeds in this study, although other types of seeds are possible (Keich et al, 2004;Kie lbasa et al, 2011). An optimal increasing subsequence of possibly overlapping anchors based on some score is then collected into a chain, where increasing is defined with the standard precedence relationship (Jain et al, 2022) between k-mer anchors (See Figure 5a and Chaining below). The chain is extended into a full alignment by aligning between anchor gaps in the chain.…”

Section: Assumptions and Modelsmentioning

confidence: 99%

“…Extension and chaining runtimes Given sorted anchors, let T Chain be the time spent finding an optimal chain. T Chain depends on the objective function (Mäkinen and Sahlin, 2020;Jain et al, 2022;Abouelhoda and Ohlebusch, 2005;Otto et al, 2011). Since our gap costs are linear, T Chain = O(N log N ) where N is the number of anchors (Abouelhoda and Ohlebusch, 2005).…”

Section: Assumptions and Modelsmentioning

confidence: 99%

See 1 more Smart Citation

Sequence aligners can guarantee accuracy in almostO(mlogn) time: a rigorous average-case analysis of the seed-chain-extend heuristic

Shaw

2022

Preprint

View full text Add to dashboard Cite

Seed-chain-extend with k-mer seeds is a powerful heuristic technique for sequence alignment employed by modern sequence aligners. While effective in practice for both runtime and accuracy, theoretical guarantees on the resulting alignment do not exist for seed-chain-extend. In this work, we give the first rigorous bounds for the efficacy of seed-chain-extend with k-mers in expectation. Assume we are given a random nucleotide sequence of length ≈ n that is indexed (or seeded) and a mutated substring of length ≈ m ≤ n with mutation rate θ < 0.206. We prove that we can find a k = Θ(log n) for the k-mer size such that the expected runtime of seed-chain-extend under optimal linear gap cost chaining and quadratic time gap extension is O(mnf(θ)log n) where f(θ) < 2.43·θ holds as a loose bound. In fact, for reasonable θ = 0.05, f(θ) < 0.08, indicating nearly quasilinear running time in practice. The alignment also turns out to be good; we prove that more than 1 − O(1/√m) fraction of the homologous bases are recoverable under an optimal chain. We also show that our bounds work when k-mers are sketched, i.e. only a subset of all k-mers is selected. Under the open syncmer sketching method, one can sketch with decreasing density as a function of n and achieve asymptotically smaller chaining time, yet the same bounds for extension time and recoverability hold. In other words, sketching reduces chaining time without increasing alignment time or decreasing accuracy too much, justifying the effectiveness of sketching as a practical speedup in sequence alignment. We verify our results in simulation and conjecture that f(θ) can be further reduced.

show abstract

Section: Assumptions and Modelsmentioning

confidence: 99%

Section: Assumptions and Modelsmentioning

confidence: 99%

Sequence aligners can guarantee accuracy in almostO(mlogn) time: a rigorous average-case analysis of the seed-chain-extend heuristic

Shaw

2022

Preprint

View full text Add to dashboard Cite

show abstract

“…Namely, we obtain an O(m + n + k 2 |V | + |E| + kN log N ) time algorithm for computing a longest common subsequence (LCS) between a query string Q and a path of G, where m = |Q|, n is the total length of node labels, k is the width (minimum number of paths covering the nodes) of G, and N is the number of maximal exact matches (MEMs) between Q and the node labels (node MEMs). For the case with two strings as input, a recent formulation of co-linear chaining [19] captures unit cost edit distance. There has been an attempt to extend the results to graphs considering gap costs [6], but it appears difficult to make such formulation fully symmetric (due to there being exponential many paths between two anchors).…”

Section: Introductionmentioning

confidence: 99%

Chaining of Maximal Exact Matches in Graphs

Rizzo¹,

Cáceres²,

Mäkinen³

2023

Preprint

View full text Add to dashboard Cite

We study the problem of finding maximal exact matches (MEMs) between a query string Q and a labeled directed acyclic graph (DAG) G = (V, E, ℓ) and subsequently co-linearly chaining these matches. We show that it suffices to compute MEMs between node labels and Q (node MEMs) to encode full MEMs. Node MEMs can be computed in linear time and we show how to co-linearly chain them to solve the Longest Common Subsequence (LCS) problem between Q and G. Our chaining algorithm is the first to consider a symmetric formulation of the chaining problem in graphs and runs inwhere k is the width (minimum number of paths covering the nodes) of G, and N is the number of node MEMs.We then consider the problem of finding MEMs when the input graph is an indexable elastic founder graph (subclass of labeled DAGs studied by Equi et al., Algorithmica 2022). For arbitrary input graphs, the problem cannot be solved in truly sub-quadratic time under SETH (Equi et al., ICALP 2019). We show that we can report all MEMs between Q and an indexable elastic founder graph in time O(nH 2 + m + Mκ), where n is the total length of node labels, H is the maximum number of nodes in a block of the graph, m = |Q|, and Mκ is the number of MEMs of length at least κ.The results extend to the indexing problem, where the graph is preprocessed and a set of queries is processed as a batch.

show abstract

“…Co-linear chaining is a mathematically rigorous approach to do clustering of anchors. It is well studied for the case of sequence-to-sequence alignment [1,11,12,16,30,34,43], and is widely used in present-day long read to reference sequence aligners [18,23,38,40].…”

Section: Introductionmentioning

confidence: 99%

“…However, the problem formulations in these works did not include gap cost. Without penalizing gaps, co-linear chaining is less effective [16]. A challenge in enforcing gap cost is that measuring gap between two loci in a DAG is not a constant-time operation like in a sequence.…”

Section: Introductionmentioning

confidence: 99%

Sequence to graph alignment using gap-sensitive co-linear chaining

Chandra

Jain

2022

Preprint

Self Cite

View full text Add to dashboard Cite

Co-linear chaining is a widely used technique in sequence alignment tools that follow seed-filter-extend methodology. It is a mathematically rigorous approach to combine small exact matches. For co-linear chaining between two sequences, efficient subquadratic-time chaining algorithms are well-known for linear, concave and convex gap cost functions [Eppstein et al. JACM'92]. However, developing extensions of chaining algorithms for DAGs (directed acyclic graphs) has been challenging. Recently, a new sparse dynamic programming framework was introduced that exploits small path cover of pangenome reference DAGs, and enables efficient chaining [Makinen et al. TALG'19, RECOMB'18]. However, the underlying problem formulation did not consider gap cost which makes chaining less effective in practice. To address this, we develop novel problem formulations and optimal chaining algorithms that support a variety of gap cost functions. We demonstrate empirically the ability of our provably-good chaining implementation to align long reads more precisely in comparison to existing aligners. For mapping simulated long reads from human genome to a pangenome DAG of 95 human haplotypes, we achieve 98.7% precision while leaving < 2% reads unmapped. Implementation: https://github.com/at-cg/minichain

show abstract

Co-linear Chaining with Overlaps and Gap Costs

Cited by 13 publications

References 32 publications

Sequence aligners can guarantee accuracy in almostO(mlogn) time: a rigorous average-case analysis of the seed-chain-extend heuristic

Sequence aligners can guarantee accuracy in almostO(mlogn) time: a rigorous average-case analysis of the seed-chain-extend heuristic

Chaining of Maximal Exact Matches in Graphs

Sequence to graph alignment using gap-sensitive co-linear chaining

Contact Info

Product

Resources

About