2022
DOI: 10.1093/bioadv/vbac029
|View full text |Cite
|
Sign up to set email alerts
|

kmtricks: efficient and flexible construction of Bloom filters for large sequencing data collections

Abstract: When indexing large collections of short-read sequencing data, a common operation that has now been implemented in several tools (Sequence Bloom Trees and variants, BIGSI,.) is to construct a collection of Bloom filters, one per sample. Each Bloom filter is used to represent a set of k-mers which approximates the desired set of all the non-erroneous k-mers present in the sample. However, this approximation is imperfect, especially in the case of metagenomics data. Erroneous but abundant k-mers are wrongly incl… Show more

Help me understand this report
View preprint versions

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
1
1

Citation Types

0
29
0

Year Published

2022
2022
2024
2024

Publication Types

Select...
4
2
1

Relationship

1
6

Authors

Journals

citations
Cited by 34 publications
(29 citation statements)
references
References 31 publications
0
29
0
Order By: Relevance
“…In our study, we have shown that it is not necessary to store the exact occurrences of all representative k -mers, it is enough to map them to some levels. Thus, all recent research ( Bingmann et al , 2019 ; Harris and Medvedev, 2020 ; Kitaya and Shibuya, 2021 ; Lemane et al , 2021 ; Marchet et al , 2020 ; Pandey et al , 2018 ; Seiler et al , 2021 ; Solomon and Kingsford, 2016 , 2018 ; Sun et al , 2018 ; Yu et al , 2018 ) for finding a fast and space efficient data structure to answer in which experiments a transcript is present, can be easily adapted to a tool estimating expression values by storing this data structure multiple times. However, as we demonstrated in our previous study ( Seiler et al , 2021 ), the most efficient data structure at the moment is the IBF.…”
Section: Discussionmentioning
confidence: 99%
See 1 more Smart Citation
“…In our study, we have shown that it is not necessary to store the exact occurrences of all representative k -mers, it is enough to map them to some levels. Thus, all recent research ( Bingmann et al , 2019 ; Harris and Medvedev, 2020 ; Kitaya and Shibuya, 2021 ; Lemane et al , 2021 ; Marchet et al , 2020 ; Pandey et al , 2018 ; Seiler et al , 2021 ; Solomon and Kingsford, 2016 , 2018 ; Sun et al , 2018 ; Yu et al , 2018 ) for finding a fast and space efficient data structure to answer in which experiments a transcript is present, can be easily adapted to a tool estimating expression values by storing this data structure multiple times. However, as we demonstrated in our previous study ( Seiler et al , 2021 ), the most efficient data structure at the moment is the IBF.…”
Section: Discussionmentioning
confidence: 99%
“…In the last few years, several tools for indexing a large amount of sequencing data were developed. These tools are based on the analysis of the underlying set of k -mers ( Bingmann et al , 2019 ; Harris and Medvedev, 2020 ; Kitaya and Shibuya, 2021 ; Lemane et al , 2021 ; Marchet et al , 2020 ; Pandey et al , 2018 ; Seiler et al , 2021 ; Solomon and Kingsford, 2016 , 2018 ; Sun et al , 2018 ; Yu et al , 2018 ). The main idea is to store the k -mers of a representative subset [e.g.…”
Section: Introductionmentioning
confidence: 99%
“…The colored unitigs of the input database could also potentially serve as a common data exchange format between different colored de Bruijn graph tools, enabling more efficient interoperability between tools. In this work, we have focused on exact k -mer indexing, but we think that recent inexact methods based on the Sequence Bloom Tree (Lemane et al ., 2022) could potentially be used to approximate Themisto pseudoalignment to a satisfactory precision. It is unclear however how such a method would scale to datasets with a large amount of colors.…”
Section: Discussionmentioning
confidence: 99%
“…In this work, we have focused on exact k-mer indexing, but we think that recent inexact methods based on the Sequence Bloom Tree (Lemane et al, 2022) could potentially be used to approximate Themisto pseudoalignment to a satisfactory precision. It is unclear however how such a method would scale to datasets with a large amount of colors.…”
Section: Discussionmentioning
confidence: 99%
“…Then, we used Kmtricks to build a k-mer count matrix for each cohort, totaling five matrices (three AML, one healthy, and one mature and immature blasts cells). Kmtricks (Lemane et al ., 2022) is a tool to count k-mers efficiently in large datasets and produce a k-mer count matrix across multiple samples. For example, the Beat-AML cohort has 7 terabytes of fastq files, Kmtricks reduced to a k-mer count compressed matrix of 78 gigabytes.…”
Section: Methodsmentioning
confidence: 99%