2023
DOI: 10.1101/2023.02.24.529942
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Themisto: a scalable coloredk-mer index for sensitive pseudoalignment against hundreds of thousands of bacterial genomes

Abstract: Motivation: Huge data sets containing whole-genome sequences of bacterial strains are now commonplace and represent a rich and important resource for modern genomic epidemiology and metagenomics. In order to efficiently make use of these data sets, efficient indexing data structures - that are both scalable and provide rapid query throughput - are paramount. Results: Here, we present Themisto, a scalable colored k-mer index designed for large collections of microbial reference genomes, that works for both shor… Show more

Help me understand this report
View published versions

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
10
0

Year Published

2023
2023
2024
2024

Publication Types

Select...
7
1
1

Relationship

1
8

Authors

Journals

citations
Cited by 17 publications
(34 citation statements)
references
References 25 publications
0
10
0
Order By: Relevance
“…SKA2 uses exact matching of k-mers, which has been the focus of intense optimisation efforts in bioinformatics, to create SNP alignments without explicitly performing any alignment. This is very similar to the pseudo-alignment approach which has become the dominant form of analysis of RNA-seq data (Bray et al 2016) and for representing population data (Holley and Melsted 2020;Alanko et al 2023;Břinda et al 2023). SKA2 has very similar benefits -speed, ease of use, robustness to structural variation, and reduced reference bias.…”
Section: Discussionmentioning
confidence: 87%
“…SKA2 uses exact matching of k-mers, which has been the focus of intense optimisation efforts in bioinformatics, to create SNP alignments without explicitly performing any alignment. This is very similar to the pseudo-alignment approach which has become the dominant form of analysis of RNA-seq data (Bray et al 2016) and for representing population data (Holley and Melsted 2020;Alanko et al 2023;Břinda et al 2023). SKA2 has very similar benefits -speed, ease of use, robustness to structural variation, and reduced reference bias.…”
Section: Discussionmentioning
confidence: 87%
“…We evaluated the performances of together with eight state-of-the-art k -mer indexers: [2]; [7]; [18]; [17]; [22]; [13]; [12]; and [3]. The dataset for this benchmark is composed of metagenomic seawater sequencing data from 50 Tara Oceans samples, of 1.4TB of gzipped fastq files.…”
Section: Resultsmentioning
confidence: 99%
“…k -mers are to genomics what words are to natural language: this way we can compare sequences by comparing their words. The number of k -mers existing in two sequences provides a metric to measure the similarity between them, leading to the so called pseudo-alignment [2]. In order to efficiently perform pseudo-alignments between any queried sequence and a dataset, we index its k -mers.…”
Section: Methodsmentioning
confidence: 99%
“…Due to the scale of databases to index, recognized tools often sacrifice precision for the sake of performance. This can be done through pseudo-alignment as defined in [2], breaking down the queried sequences into k -mers and comparing them against k -mers of the datasets, often organised in “colored de Bruijn graph” representation of as in Bifrost [12] or GGCAT [9]. Here, the graph construction is the main limitation of the methods.…”
Section: Introductionmentioning
confidence: 99%