2022
DOI: 10.1101/2022.10.14.512303
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Sequence aligners can guarantee accuracy in almostO(mlogn) time: a rigorous average-case analysis of the seed-chain-extend heuristic

Abstract: Seed-chain-extend with k-mer seeds is a powerful heuristic technique for sequence alignment employed by modern sequence aligners. While effective in practice for both runtime and accuracy, theoretical guarantees on the resulting alignment do not exist for seed-chain-extend. In this work, we give the first rigorous bounds for the efficacy of seed-chain-extend with k-mers in expectation. Assume we are given a random nucleotide sequence of length ≈ n that is indexed (or seeded) and a mutated substring of length ≈… Show more

Help me understand this report
View published versions

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

0
4
0

Year Published

2022
2022
2023
2023

Publication Types

Select...
3
1

Relationship

2
2

Authors

Journals

citations
Cited by 4 publications
(4 citation statements)
references
References 97 publications
0
4
0
Order By: Relevance
“…Under the usual assumption of no repetitive k-mers [19], it is easy to estimate θ from k-mer matching statistics [7,19,4] between G and G ′ . We proved in a previous work that for random, mutating strings, the expected value of such spurious matches for a string of length n is (Theorem 2 in Shaw and Yu [20]), so this is not a bad assumption in practice when k is reasonably large and for simpler, non-eukaryotic genomes. However, when dealing with MAGs, we don’t have G and G ′ but instead fragmented, contaminated, and incomplete versions of G and G ′ .…”
Section: Methodsmentioning
confidence: 95%
“…Under the usual assumption of no repetitive k-mers [19], it is easy to estimate θ from k-mer matching statistics [7,19,4] between G and G ′ . We proved in a previous work that for random, mutating strings, the expected value of such spurious matches for a string of length n is (Theorem 2 in Shaw and Yu [20]), so this is not a bad assumption in practice when k is reasonably large and for simpler, non-eukaryotic genomes. However, when dealing with MAGs, we don’t have G and G ′ but instead fragmented, contaminated, and incomplete versions of G and G ′ .…”
Section: Methodsmentioning
confidence: 95%
“…Most alignment tools use seed-chain-extend heuristic to compute the alignments quickly [53,54]. Given a set of seed matches (anchors) as input, co-linear chaining is a rigorous optimization technique to identify promising alignment regions in a reference.…”
Section: Methods For Haplotype-aware Chaining On Graphsmentioning
confidence: 99%
“…There has also been recent advancement on the theoretical side. In [92], the authors show that long read mapping using the seed-chain-extend method is both fast and accurate with some guarantees on the average-case time complexity. We therefore believe this methodology will continue to be a popular approach in the domain of long read mapping.…”
Section: Future Directionsmentioning
confidence: 99%