Disk compression of k-mer sets

Rahman, Amatur; Chikhi, Rayan; Medvedev, Paul

doi:10.1186/s13015-021-00192-7

Cited by 17 publications

(19 citation statements)

References 36 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…It may even be possible to compute all four Jaccard indices without actually replacing letters by defining hash functions that do not distinguish letters. Finally, NSB may be able to use compressed k -mer sets ( Rahman et al , 2021 ) to reduce its storage while keeping the same accuracy. We leave the exploration of these avenues to further work.…”

Section: Discussionmentioning

confidence: 99%

Genome-wide alignment-free phylogenetic distance estimation under a no strand-bias model

Balaban

Bristy

Faisal

et al. 2022

Bioinformatics Advances

View full text Add to dashboard Cite

While alignment has been the dominant approach for determining homology prior to phylogenetic inference, alignment-free methods can simplify the analysis, especially when analyzing genome-wide data. Furthermore, alignment-free methods present the only option for emerging forms of data, such as genome skims, which do not permit assembly. Despite the appeal, alignment-free methods have not been competitive with alignment-based methods in terms of accuracy. One limitation of alignment-free methods is their reliance on simplified models of sequence evolution such as Jukes-Cantor. If we can estimate frequencies of base substitutions in an alignment-free setting, we can compute pairwise distances under more complex models. However, since the strand of DNA sequences is unknown for many forms of genome-wide data, which arguably present the best use case for alignment-free methods, the most complex models that one can use are the so-called no strand-bias models. We show how to calculate distances under a four-parameter no strand-bias model called TK4 without relying on alignments or assemblies. The main idea is to replace letters in the input sequences and recompute Jaccard indices between k-mer sets. However, on larger genomes, we also need to compute the number of k-mer mismatches after replacement due to random chance as opposed to homology. We show in simulation that alignment-free distances can be highly accurate when genomes evolve under the assumed models and study the accuracy on assembled and unassembled biological data. Our software is available open-source at https://github.com/nishatbristy007/NSB.

show abstract

Section: Discussionmentioning

confidence: 99%

Genome-wide alignment-free phylogenetic distance estimation under a no strand-bias model

Balaban

Bristy

Faisal

et al. 2022

Bioinformatics Advances

View full text Add to dashboard Cite

show abstract

“…Previous papers used these concepts somewhat informally; when definitions were given, they worked in the context of that paper but failed to have more general desired properties. For example, our previous work had an inconsistency in the way that a walk was defined on a single vertex versus on many vertices [28]. One key takeaway is that as a rule thumb, when working with bidirected graphs one should avoid thinking in terms of vertices but think instead of vertex-sides.…”

Section: Discussionmentioning

confidence: 99%

Assembler artifacts include misassembly because of unsafe unitigs and underassembly because of bidirected graphs

Rahman

Medvedev

2022

Genome Res.

Self Cite

View full text Add to dashboard Cite

Recent assemblies by the T2T and VGP consortia have achieved significant accuracy but required a tremendous amount of effort and resources. More typical assembly efforts, on the other hand, still suffer both from misassemblies (joining sequences that should not be adjacent) and from under-assemblies (not joining sequences that should be adjacent). To better understand the common algorithm-driven causes of these limitations, we investigated the unitig algorithm, which is a core algorithm at the heart of most assemblers. We prove that, contrary to popular belief, even when there are no sequencing errors, unitigs are not always safe (i.e. they are not guaranteed to be substrings of the sequenced genome). We also prove that the unitigs of a bidirected de Bruijn graph are different from those of a doubled de Bruijn graph and, contrary to our expectations, result in under-assembly. Using experimental simulations, we then confirm that these two artifacts exist not only in theory but also in the output of widely used assemblers. To the best of our knowledge, this paper is the first to theoretically predict the existence of these assembler artifacts and confirm and measure the extent of their occurrence in practice.

show abstract

“…However, such methods do not provide guarantees on the accuracy of their approximations that are simultaneously valid for all (or the most frequent) k-mers. In recent years other problems closely related to the task of counting k-mers have been studied, including how to efficiently index [38,15,30,28], represent [7,10,1,14,14,29,17,44], query [53,54,60,55,5,27], and store [18,35,16,43] the massive collections of sequences or of k-mers that are extracted from the data. A natural approach to reduce computational demands is to analyze a small sample instead of the entire dataset.…”

Section: Related Workmentioning

confidence: 99%

SPRISS: Approximating Frequent $k$-mers by Sampling Reads, and Applications

Santoro¹,

Pellegrina²,

Vandin³

2021

Preprint

View full text Add to dashboard Cite

The extraction of k-mers is a fundamental component in many complex analyses of large nextgeneration sequencing datasets, including reads classification in genomics and the characterization of RNA-seq datasets. The extraction of all k-mers and their frequencies is extremely demanding in terms of running time and memory, owing to the size of the data and to the exponential number of k-mers to be considered. However, in several applications, only frequent k-mers, which are k-mers appearing in a relatively high proportion of the data, are required by the analysis. In this work we present SPRISS, a new efficient algorithm to approximate frequent k-mers and their frequencies in next-generation sequencing data. SPRISS employs a simple yet powerful reads sampling scheme, which allows to extract a representative subset of the dataset that can be used, in combination with any k-mer counting algorithm, to perform downstream analyses in a fraction of the time required by the analysis of the whole data, while obtaining comparable answers. Our extensive experimental evaluation demonstrates the efficiency and accuracy of SPRISS in approximating frequent k-mers, and shows that it can be used in various scenarios, such as the comparison of metagenomic datasets and the identification of discriminative k-mers, to extract insights in a fraction of the time required by the analysis of the whole dataset.

show abstract

Disk compression of k-mer sets

Cited by 17 publications

References 36 publications

Genome-wide alignment-free phylogenetic distance estimation under a no strand-bias model

Genome-wide alignment-free phylogenetic distance estimation under a no strand-bias model

Assembler artifacts include misassembly because of unsafe unitigs and underassembly because of bidirected graphs

SPRISS: Approximating Frequent $k$-mers by Sampling Reads, and Applications

Contact Info

Product

Resources

About