Sequence aligners can guarantee accuracy in almostO(mlogn) time: a rigorous average-case analysis of the seed-chain-extend heuristic

Shaw, Jim; Yu, Yun William

doi:10.1101/2022.10.14.512303

Cited by 4 publications

(4 citation statements)

References 97 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Under the usual assumption of no repetitive k-mers [19], it is easy to estimate θ from k-mer matching statistics [7,19,4] between G and G ′ . We proved in a previous work that for random, mutating strings, the expected value of such spurious matches for a string of length n is (Theorem 2 in Shaw and Yu [20]), so this is not a bad assumption in practice when k is reasonably large and for simpler, non-eukaryotic genomes. However, when dealing with MAGs, we don’t have G and G ′ but instead fragmented, contaminated, and incomplete versions of G and G ′ .…”

Section: Methodsmentioning

confidence: 95%

Fast and robust metagenomic sequence comparison through sparse chaining with skani

Shaw

2023

Preprint

Self Cite

View full text Add to dashboard Cite

Sequence comparison algorithms for metagenome-assembled genomes (MAGs) often have difficulties dealing with data that is high-volume or low-quality. We present skani, a method for calculating average nucleotide identity (ANI) using sparse approximate alignments. skani is more accurate than FastANI for comparing incomplete, fragmented MAGs while also being > 20 times faster. For searching a database of > 65,000 prokaryotic genomes, skani takes only seconds per query and 5 GB of memory. skani is a versatile tool that unlocks higher-resolution insights for larger, noisier metagenomic data sets.

show abstract

Section: Methodsmentioning

confidence: 95%

Fast and robust metagenomic sequence comparison through sparse chaining with skani

Shaw

2023

Preprint

Self Cite

View full text Add to dashboard Cite

show abstract

“…Most alignment tools use seed-chain-extend heuristic to compute the alignments quickly [53,54]. Given a set of seed matches (anchors) as input, co-linear chaining is a rigorous optimization technique to identify promising alignment regions in a reference.…”

Section: Methods For Haplotype-aware Chaining On Graphsmentioning

confidence: 99%

Haplotype-aware sequence alignment to pangenome graphs

Chandra,

Gibney,

Jain

2023

Preprint

View full text Add to dashboard Cite

Modern pangenome graphs are built using high-quality phased haplotype sequences such that each haplotype sequence corresponds to a path in the graph. Prioritizing the alignment of reads to these paths improves genotyping accuracy (Sirenet al., Science 2021). However, rigorous formulations for sequence-to-graph chaining and alignment do not consider the haplotype paths. As a result, the search space increases combinatorially as more variants are augmented in the graph. This limitation affects the effectiveness of the algorithms. In this paper, we propose novel formulations and provably good algorithms for haplotype-aware pattern matching of sequences to directed acyclic graphs (DAGs). Our work considers both sequence-to-DAG chaining and sequence-to-DAG alignment problems. Drawing inspiration from the commonly used models for genotype imputation, we assume that a query sequence is an imperfect mosaic of the reference haplotypes. Accordingly, our formulations extend previous chaining and alignment formulations by introducing a recombination penalty for a haplotype switch. First, we solve the haplotype-aware sequence-to-DAG alignment inO(|Q| |E||ℋ |) time whereQis the query sequence,Eis the set of edges, and ℋis the set of haplotypes represented in the graph. Second, we prove that an algorithm significantly faster thanO(|Q| |E||ℋ |) is unlikely. Third, we propose a haplotype-aware chaining algorithm that usesO(|ℋ |Nlog |ℋ |N) time, whereNis the count of exact matches. As a proof-of-concept, we implemented the chaining algorithm in the Minichain aligner (https://github.com/at-cg/minichain). Using simulated human major histocompatibility complex (MHC) query sequences and a pangenome graph of 60 publicly available MHC haplotypes, we show that the proposed algorithm offers a much better consistency between the ground-truth recombinations and the recombinations in the output chains when compared to a haplotype-agnostic algorithm.

show abstract

“…There has also been recent advancement on the theoretical side. In [92], the authors show that long read mapping using the seed-chain-extend method is both fast and accurate with some guarantees on the average-case time complexity. We therefore believe this methodology will continue to be a popular approach in the domain of long read mapping.…”

Section: Future Directionsmentioning

confidence: 99%

A survey of mapping algorithms in the long-reads era

et al. 2023

View full text Add to dashboard Cite

It has been over a decade since the first publication of a method dedicated entirely to mapping long-reads. The distinctive characteristics of long reads resulted in methods moving from the seed-and-extend framework used for short reads to a seed-and-chain framework due to the seed abundance in each read. The main novelties are based on alternative seed constructs or chaining formulations. Dozens of tools now exist, whose heuristics have evolved considerably. We provide an overview of the methods used in long-read mappers. Since they are driven by implementation-specific parameters, we develop an original visualization tool to understand the parameter settings (http://bcazaux.polytech-lille.net/Minimap2/).

show abstract

Sequence aligners can guarantee accuracy in almostO(mlogn) time: a rigorous average-case analysis of the seed-chain-extend heuristic

Cited by 4 publications

References 97 publications

Fast and robust metagenomic sequence comparison through sparse chaining with skani

Fast and robust metagenomic sequence comparison through sparse chaining with skani

Haplotype-aware sequence alignment to pangenome graphs

A survey of mapping algorithms in the long-reads era

Contact Info

Product

Resources

About