2020
DOI: 10.1101/2020.10.01.322164
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Indexing All Life’s Known Biological Sequences

Abstract: The amount of biological sequencing data available in public repositories is growing exponentially, forming an invaluable biomedical research resource. Yet, making all this sequencing data searchable and easily accessible to life science and data science researchers is an unsolved problem. We present MetaGraph, a versatile framework for the scalable analysis of extensive sequence repositories. MetaGraph efficiently indexes vast collections of sequences to enable fast search and comprehensive analysis. A wide r… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
1

Citation Types

0
108
0

Year Published

2020
2020
2022
2022

Publication Types

Select...
4
3
2

Relationship

3
6

Authors

Journals

citations
Cited by 52 publications
(130 citation statements)
references
References 121 publications
(299 reference statements)
0
108
0
Order By: Relevance
“…Contemporary of kmtricks, the MetaGraph software (Karasikov et al, 2020) is a k-mer indexing structure that represents k-mers exactly (i.e. not using a Bloom filter) and does not support creating k-mer matrices.…”
Section: Discussionmentioning
confidence: 99%
“…Contemporary of kmtricks, the MetaGraph software (Karasikov et al, 2020) is a k-mer indexing structure that represents k-mers exactly (i.e. not using a Bloom filter) and does not support creating k-mer matrices.…”
Section: Discussionmentioning
confidence: 99%
“…To this end, the de Bruijn graph has become an object of central importance in many genomic analysis tasks. While it was initially used mostly in the context of genome (and transcriptome) assembly (EULER [42], Velvet [51,52], ALLPATHS [9,30], EULER-SR [10], ABySS [46], SOAPdenovo [25,29], Trans-AByss [43], SPAdes [5], Minia [13]), it has seen increasing use in comparative genomics (Cortex [19], DISCOSNP [50], Scalpel [15], BubbZ [34]) and has also been used increasingly in the context of indexing genomic data, either from raw sequencing reads (Mantis [40,1], Vari [37], VariMerge [36], MetaGraph [20]), or from assembled reference sequences (deBGA [27], Pufferfish [2], deSALT [28]), or from both (BLight [32], Bifrost [17]). These latter applications most frequently make use of the (colored) compacted de Bruijn graph, a variant of the de Bruijn graph in which maximal non-branching paths (unitigs) are condensed into single vertices in the underlying graph structure.…”
Section: Introductionmentioning
confidence: 99%
“…To this end, the de Bruijn graph has become an object of central importance in many genomic analysis tasks. While it was initially used mostly in the context of genome (and transcriptome) assembly (EULER (Pevzner et al, 2001), EULER-SR (Chaisson and Pevzner, 2008), Velvet (Zerbino and Birney, 2008;Zerbino et al, 2009), ALLPATHS (Butler et al, 2008;MacCallum et al, 2009), ABySS (Simpson et al, 2009), Trans-AByss (Robertson et al, 2010), SPAdes (Bankevich et al, 2012), Minia (Chikhi and Rizk, 2013), SOAPdenovo (Li et al, 2010;Luo et al, 2015)), it has seen increasing use in i i i i i i i i comparative genomics (Cortex (Iqbal et al, 2012), DISCOSNP (Uricaru et al, 2014), Scalpel (Fang et al, 2016), BubbZ (Minkin and Medvedev, 2020)), and has also been used increasingly in the context of indexing genomic data, either from raw sequencing reads (Vari (Muggli et al, 2017), Mantis (Pandey et al, 2018;Almodaresi et al, 2019), VariMerge (Muggli et al, 2019), MetaGraph (Karasikov et al, 2020)), or from assembled reference sequences (deBGA (Liu et al, 2016), Pufferfish (Almodaresi et al, 2018), deSALT (Liu et al, 2019)), or from both (BLight (Marchet et al, 2019), Bifrost (Holley and Melsted, 2020)). These latter applications most frequently make use of the (colored) compacted de Bruijn graph, a variant of the de Bruijn graph in which the maximal non-branching paths (also referred to as unitigs) are condensed into single vertices in the underlying graph structure.…”
Section: Introductionmentioning
confidence: 99%
“…Longer strings are queried as a succession of k -mers. Although it is a lossy representation of the input (as, e.g., repeats longer than k are collapsed), constructing k -mer sets has proved highly useful in practice [3, 4, 5, 6].…”
Section: Introductionmentioning
confidence: 99%