2023
DOI: 10.1093/bioinformatics/btad057
|View full text |Cite
|
Sign up to set email alerts
|

How to optimally sample a sequence for rapid analysis

Abstract: Motivation We face an increasing flood of genetic sequence data, from diverse sources, requiring rapid computational analysis. Rapid analysis can be achieved by sampling a subset of positions in each sequence. Previous sequence-sampling methods, such as minimizers, syncmers, and minimally-overlapping words, were developed by heuristic intuition, and are not optimal. Results We present a sequence-sampling approach that provabl… Show more

Help me understand this report
View preprint versions

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1

Citation Types

0
3
0

Year Published

2023
2023
2023
2023

Publication Types

Select...
4
1

Relationship

1
4

Authors

Journals

citations
Cited by 5 publications
(3 citation statements)
references
References 36 publications
0
3
0
Order By: Relevance
“…This theorem is the reason we use open syncmers and is crucial to our proofs. The spacing property makes selected open syncmers a polar set ( Zheng et al 2021 ); other methods also give rise to polar sets ( Frith et al 2021 , 2023 ) but open syncmers seem to perform well empirically ( Dutta et al 2022 ; Shaw and Yu 2022 ; Frith et al 2023 ) and are easy to describe. For the rest of the section, we will assume c = k − s + 1 is odd, so .…”
Section: Methodsmentioning
confidence: 99%
“…This theorem is the reason we use open syncmers and is crucial to our proofs. The spacing property makes selected open syncmers a polar set ( Zheng et al 2021 ); other methods also give rise to polar sets ( Frith et al 2021 , 2023 ) but open syncmers seem to perform well empirically ( Dutta et al 2022 ; Shaw and Yu 2022 ; Frith et al 2023 ) and are easy to describe. For the rest of the section, we will assume c = k − s + 1 is odd, so .…”
Section: Methodsmentioning
confidence: 99%
“…This theorem is the reason we use open syncmers and is crucial to our proofs. The spacing property makes selected open syncmers a polar set (Zheng et al, 2021); other methods also give rise to polar sets (Frith et al, 2020, 2022) but open syncmers seem to perform well empirically (Shaw and Yu, 2022; Frith et al, 2022; Dutta et al, 2022) and are easy to describe. For the rest of the section, we will assume c = k – s + 1 is odd, so .…”
Section: Methodsmentioning
confidence: 99%
“…These seeding constructs have been referred to as dynamic seeds (Sahlin, Baudeau, et al 2022) as they are neither fixed in length nor in the number of CPU cycles for their construction. There are also seeding constructs known as subsampling methods that aim to use only a subsample of k-mers as seeds due to their redundant 2 nature using, e.g., minimizers (Roberts et al 2004) or later subsampling techniques (DeBlasio et al 2019;Frith, Noé, et al 2020;Ekim, Berger, and Orenstein 2020;Zheng et al 2021;Edgar 2021;Frith, Shaw, et al 2023). For an extensive study of subsampling techniques, see (Shaw and Yu 2021).…”
Section: Other Seed Constructsmentioning
confidence: 99%