2021
DOI: 10.1093/bioinformatics/btab790
|View full text |Cite
|
Sign up to set email alerts
|

Theory of local k-mer selection with applications to long-read alignment

Abstract: Motivation Selecting a subset of k-mers in a string in a local manner is a common task in bioinformatics tools for speeding up computation. Arguably the most well-known and common method is the minimizer technique, which selects the ‘lowest-ordered’ k-mer in a sliding window. Recently, it has been shown that minimizers may be a sub-optimal method for selecting subsets of k-mers when mutations are present. There is, however, a lack of understanding behind the theory of why certain methods perf… Show more

Help me understand this report
View preprint versions

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

1
33
0

Year Published

2022
2022
2024
2024

Publication Types

Select...
5
2
1

Relationship

4
4

Authors

Journals

citations
Cited by 32 publications
(34 citation statements)
references
References 37 publications
1
33
0
Order By: Relevance
“…These new k-mers are called seeds instead of markers because we actually use them as seeds for k-mer matching and alignment. We note that while we could have used other "context-independent" k-mer seeding methods that are more "conserved" than FracMinHash [28], we found that FracMinHash works well enough for relatively sparse seeds when c ≫ k. By default, k = 15 and c = 125. We note that the small value of k = 15 used by default leads to too many repetitive anchors on larger genomes, so we mask the top seeds that occur more than 2500/c times by default.…”
Section: Obtaining Sparse Seeds For Chainingmentioning
confidence: 88%
See 1 more Smart Citation
“…These new k-mers are called seeds instead of markers because we actually use them as seeds for k-mer matching and alignment. We note that while we could have used other "context-independent" k-mer seeding methods that are more "conserved" than FracMinHash [28], we found that FracMinHash works well enough for relatively sparse seeds when c ≫ k. By default, k = 15 and c = 125. We note that the small value of k = 15 used by default leads to too many repetitive anchors on larger genomes, so we mask the top seeds that occur more than 2500/c times by default.…”
Section: Obtaining Sparse Seeds For Chainingmentioning
confidence: 88%
“…These new k-mers are called seeds instead of markers because we actually use them as seeds for k-mer matching and alignment. We note that while we could have used other “context-independent” k-mer seeding methods that are more “conserved” than FracMinHash [28], we found that FracMinHash works well enough for relatively sparse seeds when c ≫ k . By default, k = 15 and c = 125.…”
Section: Methodsmentioning
confidence: 99%
“…The original open syncmer definition in (Edgar, 2021) had a parameter t where a k-mer was selected if the smallest s-mer was in the t−th position; we proved in (Shaw and Yu, 2022) that the optimal t is ⌈ k−s+1 2 ⌉ with respect to maximizing the number of conserved bases from k-mer matching. The reason we choose open syncmers is primarily to the following fact which was shown in (Edgar, 2021): Theorem 7 follows by examining the smallest s-mer in a k-mer and noticing that in the next overlapping k-mer, the locations for the new smallest s-mer are restricted.…”
Section: Sketching and Local K-mer Selectionmentioning
confidence: 91%
“…) which we will ignore; see (Zheng et al, 2020) or (Shaw and Yu, 2022)). We will let c be the reciprocal of the density, so c = (k − s + 1).…”
Section: Sketching and Local K-mer Selectionmentioning
confidence: 99%
“…The main way of analyzing genetic sequences is by comparing them to each other. For large data, this is usually done via ‘seeds’, by which we mean simple similarities that can be found quickly ( Shaw and Yu, 2022 ). The simplest seeds are fixed-length exact matches, but they can also be inexact ( Altschul et al , 1990 ; Ma et al , 2002 ; Noé and Kucherov, 2004 ; Sahlin, 2021 ) and/or variable length ( Csűrös, 2004 ).…”
Section: Introductionmentioning
confidence: 99%