Indexing All Life’s Known Biological Sequences

Karasikov, Mikhail; Mustafa, Harun; Danciu, Daniel; Zimmermann, Marc; Barber, Christopher A.; Rätsch, Gunnar; Kahles, André

doi:10.1101/2020.10.01.322164

Cited by 52 publications

(130 citation statements)

References 121 publications

(299 reference statements)

Supporting

Mentioning

108

Contrasting

Order By: Relevance

“…Contemporary of kmtricks, the MetaGraph software (Karasikov et al, 2020) is a k-mer indexing structure that represents k-mers exactly (i.e. not using a Bloom filter) and does not support creating k-mer matrices.…”

Section: Discussionmentioning

confidence: 99%

kmtricks: Efficient and flexible construction of Bloom filters for large sequencing data collections

Lemane

Medvedev

Chikhi

et al. 2021

Preprint

View full text Add to dashboard Cite

When indexing large collection of sequencing data, a common operation that has now been implemented in several tools (Sequence Bloom Trees and variants, BIGSI, ..) is to construct a collection of Bloom filters, one per sample. Each Bloom filter is used to represent a set of k-mers which approximates the desired set of all the non-erroneous k-mers present in the sample. However, this approximation is imperfect, especially in the case of metagenomics data. Erroneous but abundant k-mers are wrongly included, and non-erroneous but low-abundant ones are wrongly discarded. We propose kmtricks, a novel approach for generating Bloom filters from terabase-sized collections of sequencing data. Our main contributions are 1/ an efficient method for jointly counting k-mers across multiple samples, including a streamlined Bloom filter construction by directly counting hashes instead of k-mers; 2/ a novel technique that takes advantage of joint counting to preserve low-abundant k-mers present in several samples, improving the recovery of non-erroneous k-mers. In addition, our experimental results highlight that the usual yet crude filtering of low-abundant k-mers is inappropriate for complex data such as metagenomes.

show abstract

Section: Discussionmentioning

confidence: 99%

kmtricks: Efficient and flexible construction of Bloom filters for large sequencing data collections

Lemane

Medvedev

Chikhi

et al. 2021

Preprint

View full text Add to dashboard Cite

show abstract

“…To this end, the de Bruijn graph has become an object of central importance in many genomic analysis tasks. While it was initially used mostly in the context of genome (and transcriptome) assembly (EULER [42], Velvet [51,52], ALLPATHS [9,30], EULER-SR [10], ABySS [46], SOAPdenovo [25,29], Trans-AByss [43], SPAdes [5], Minia [13]), it has seen increasing use in comparative genomics (Cortex [19], DISCOSNP [50], Scalpel [15], BubbZ [34]) and has also been used increasingly in the context of indexing genomic data, either from raw sequencing reads (Mantis [40,1], Vari [37], VariMerge [36], MetaGraph [20]), or from assembled reference sequences (deBGA [27], Pufferfish [2], deSALT [28]), or from both (BLight [32], Bifrost [17]). These latter applications most frequently make use of the (colored) compacted de Bruijn graph, a variant of the de Bruijn graph in which maximal non-branching paths (unitigs) are condensed into single vertices in the underlying graph structure.…”

Section: Introductionmentioning

confidence: 99%

“…To this end, the de Bruijn graph has become an object of central importance in many genomic analysis tasks. While it was initially used mostly in the context of genome (and transcriptome) assembly (EULER (Pevzner et al, 2001), EULER-SR (Chaisson and Pevzner, 2008), Velvet (Zerbino and Birney, 2008;Zerbino et al, 2009), ALLPATHS (Butler et al, 2008;MacCallum et al, 2009), ABySS (Simpson et al, 2009), Trans-AByss (Robertson et al, 2010), SPAdes (Bankevich et al, 2012), Minia (Chikhi and Rizk, 2013), SOAPdenovo (Li et al, 2010;Luo et al, 2015)), it has seen increasing use in i i i i i i i i comparative genomics (Cortex (Iqbal et al, 2012), DISCOSNP (Uricaru et al, 2014), Scalpel (Fang et al, 2016), BubbZ (Minkin and Medvedev, 2020)), and has also been used increasingly in the context of indexing genomic data, either from raw sequencing reads (Vari (Muggli et al, 2017), Mantis (Pandey et al, 2018;Almodaresi et al, 2019), VariMerge (Muggli et al, 2019), MetaGraph (Karasikov et al, 2020)), or from assembled reference sequences (deBGA (Liu et al, 2016), Pufferfish (Almodaresi et al, 2018), deSALT (Liu et al, 2019)), or from both (BLight (Marchet et al, 2019), Bifrost (Holley and Melsted, 2020)). These latter applications most frequently make use of the (colored) compacted de Bruijn graph, a variant of the de Bruijn graph in which the maximal non-branching paths (also referred to as unitigs) are condensed into single vertices in the underlying graph structure.…”

Section: Introductionmentioning

confidence: 99%

Cuttlefish: Fast, parallel, and low-memory compaction of de Bruijn graphs from large-scale genome collections

Khan

Patro

2020

Preprint

View full text Add to dashboard Cite

Motivation: The construction of the compacted de Bruijn graph from a large collection of reference genomes is a task of increasing interest in genomic analyses. For example, compacted colored reference de Bruijn graphs are increasingly used as sequence indices for the purposes of alignment of short and long reads. Also, as we sequence and assemble a greater diversity of individual genomes, the compacted colored de Bruijn graph can be used as the basis for methods aiming to perform comparative genomic analyses on these genomes. While algorithms have been developed to construct the compacted colored de Bruijn graph from reference sequences, there is still room for improvement, especially in the memory and the runtime performance as the number and the scale of the genomes over which the de Bruijn graph is built grow. Results: We introduce a new algorithm, implemented in the tool Cuttlefish, to construct the colored compacted de Bruijn graph from a collection of one or more genome references. Cuttlefish introduces a novel modeling scheme of the de Bruijn graph vertices as finite-state automata, and constrains the state-space for the automata to enable tracking of their transitioning states with very low memory usage. Cuttlefish is also fast and highly parallelizable. Experimental results demonstrate that the algorithm scales much better than existing approaches, especially as the number and scale of the input references grow. For example, on a typical shared-memory machine, Cuttlefish constructed the compacted graph for 100 human genomes in less than 7 hours, using ~29 GB of memory; no other tested tool successfully completed this task on the testing hardware. We also applied Cuttlefish on 11 diverse conifer plant genomes, and the compacted graph was constructed in under 11 hours, using ~84 GB of memory, while the only other tested tool able to complete this compaction on our hardware took more than 16 hours and ~289 GB of memory. Availability: Cuttlefish is written in C++14, and is available under an open source license at https://github.com/COMBINE-lab/cuttlefish.

show abstract

“…Longer strings are queried as a succession of k -mers. Although it is a lossy representation of the input (as, e.g., repeats longer than k are collapsed), constructing k -mer sets has proved highly useful in practice [3, 4, 5, 6].…”

Section: Introductionmentioning

confidence: 99%

Topology-based Sparsification of Graph Annotations

Danciu

Karasikov

Mustafa

et al. 2020

Preprint

Self Cite

View full text Add to dashboard Cite

Since the amount of published biological sequencing data is growing exponentially, efficient methods for storing and indexing this data are more needed than ever to truly benefit from this invaluable resource for biomedical research. Labeled de Bruijn graphs are a frequently-used approach for representing large sets of sequencing data. While significant progress has been made to succinctly represent the graph itself, efficient methods for storing labels on such graphs are still rapidly evolving. In this paper, we present RowDiff, a new technique for compacting graph labels by leveraging expected similarities in annotations of nodes adjacent in the graph. RowDiff can be constructed in linear time relative to the number of nodes and labels in the graph, and the construction can be efficiently parallelized and distributed, significantly reducing construction time. RowDiff can be viewed as an intermediary sparsification step of the initial annotation matrix and can thus naturally be combined with existing generic schemes for compressed binary matrix representation. Our experiments on the Fungi subset of the RefSeq collection show that applying RowDiff sparsification reduces the size of individual annotation columns stored as compressed bit vectors by an average factor of 42. When combining RowDiff with a Multi-BRWT representation, the resulting annotation is 26 times smaller than Mantis-MST, the previously known smallest annotation representation. In addition, experiments on 10,000 RNA-seq datasets show that RowDiff combined with Multi-BRWT results in a 30% reduction in annotation footprint over Mantis-MST.

show abstract

Indexing All Life’s Known Biological Sequences

Cited by 52 publications

References 121 publications

kmtricks: Efficient and flexible construction of Bloom filters for large sequencing data collections

kmtricks: Efficient and flexible construction of Bloom filters for large sequencing data collections

Cuttlefish: Fast, parallel, and low-memory compaction of de Bruijn graphs from large-scale genome collections

Topology-based Sparsification of Graph Annotations

Contact Info

Product

Resources

About