2021
DOI: 10.1186/s13015-021-00192-7
|View full text |Cite
|
Sign up to set email alerts
|

Disk compression of k-mer sets

Abstract: K-mer based methods have become prevalent in many areas of bioinformatics. In applications such as database search, they often work with large multi-terabyte-sized datasets. Storing such large datasets is a detriment to tool developers, tool users, and reproducibility efforts. General purpose compressors like gzip, or those designed for read data, are sub-optimal because they do not take into account the specific redundancy pattern in k-mer sets. In our earlier work (Rahman and Medvedev, RECOMB 2020), we prese… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
2
1

Citation Types

0
19
0

Year Published

2021
2021
2024
2024

Publication Types

Select...
5
1
1
1

Relationship

1
7

Authors

Journals

citations
Cited by 17 publications
(19 citation statements)
references
References 36 publications
0
19
0
Order By: Relevance
“…It may even be possible to compute all four Jaccard indices without actually replacing letters by defining hash functions that do not distinguish letters. Finally, NSB may be able to use compressed k -mer sets ( Rahman et al , 2021 ) to reduce its storage while keeping the same accuracy. We leave the exploration of these avenues to further work.…”
Section: Discussionmentioning
confidence: 99%
“…It may even be possible to compute all four Jaccard indices without actually replacing letters by defining hash functions that do not distinguish letters. Finally, NSB may be able to use compressed k -mer sets ( Rahman et al , 2021 ) to reduce its storage while keeping the same accuracy. We leave the exploration of these avenues to further work.…”
Section: Discussionmentioning
confidence: 99%
“…Previous papers used these concepts somewhat informally; when definitions were given, they worked in the context of that paper but failed to have more general desired properties. For example, our previous work had an inconsistency in the way that a walk was defined on a single vertex versus on many vertices [28]. One key takeaway is that as a rule thumb, when working with bidirected graphs one should avoid thinking in terms of vertices but think instead of vertex-sides.…”
Section: Discussionmentioning
confidence: 99%
“…However, such methods do not provide guarantees on the accuracy of their approximations that are simultaneously valid for all (or the most frequent) k-mers. In recent years other problems closely related to the task of counting k-mers have been studied, including how to efficiently index [38,15,30,28], represent [7,10,1,14,14,29,17,44], query [53,54,60,55,5,27], and store [18,35,16,43] the massive collections of sequences or of k-mers that are extracted from the data. A natural approach to reduce computational demands is to analyze a small sample instead of the entire dataset.…”
Section: Related Workmentioning
confidence: 99%