2014
DOI: 10.1093/bioinformatics/btu438
|View full text |Cite
|
Sign up to set email alerts
|

Journaled string tree—a scalable data structure for analyzing thousands of similar genomes on your laptop

Abstract: In this work, we provide a datatype that exploits data parallelism inherent in a set of similar sequences by analyzing shared regions only once. In real-world experiments, we show that algorithms that otherwise would scan each reference sequentially can be speeded up by a factor of 115.

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
19
0

Year Published

2015
2015
2021
2021

Publication Types

Select...
5
2
1

Relationship

0
8

Authors

Journals

citations
Cited by 31 publications
(19 citation statements)
references
References 31 publications
0
19
0
Order By: Relevance
“…It is distributed under the FreeBSD License and available at https://gitlab.com/rki_bioinformatics. [11] Multiple sequence mapping reference sequence + variants no no no no no cdbg [2] Graph construction multiple reference sequences external yes no no no cdbg search [6] Graph construction multiple reference sequences external yes no no no GCSA [10] Graph indexing reference sequence + variants no no no no no Multiple sequence mapping GCSA2 [8] Graph indexing variation graph no no no no no GenomeMapper [12] Multiple sequence mapping reference sequence + variants no no no no no GenomeRing [3] Pan-genome data structure whole genome alignment yes yes no no yes JST [15] Pan-genome data structure reference sequence + variants no yes yes yes yes MHC-PRG [13] Pan-genome data structure multiple sequence alignment no no yes no no Multiple sequence AND variants variant detection PanCake [16] Pan-genome data structure multiple reference sequences external yes yes no no AND pairwise alignment panVC [14] Multiple sequence variant detection whole genome alignment external yes no no yes SplitMEM [43] Graph construction multiple reference sequences external yes no no no svaha [9] Graph construction reference sequence + variants external yes no no no TwoPaCo [7] Graph construction multiple reference sequences external yes no no no vg [17] Pan-genome data structure reference sequence + variants external yes yes* yes* yes OR multiple reference sequences Table 1 Comparison of pan-genome tools. We analyzed tools for pan-genome analysis that are available or currently under development.…”
Section: Declarationsmentioning
confidence: 99%
See 2 more Smart Citations
“…It is distributed under the FreeBSD License and available at https://gitlab.com/rki_bioinformatics. [11] Multiple sequence mapping reference sequence + variants no no no no no cdbg [2] Graph construction multiple reference sequences external yes no no no cdbg search [6] Graph construction multiple reference sequences external yes no no no GCSA [10] Graph indexing reference sequence + variants no no no no no Multiple sequence mapping GCSA2 [8] Graph indexing variation graph no no no no no GenomeMapper [12] Multiple sequence mapping reference sequence + variants no no no no no GenomeRing [3] Pan-genome data structure whole genome alignment yes yes no no yes JST [15] Pan-genome data structure reference sequence + variants no yes yes yes yes MHC-PRG [13] Pan-genome data structure multiple sequence alignment no no yes no no Multiple sequence AND variants variant detection PanCake [16] Pan-genome data structure multiple reference sequences external yes yes no no AND pairwise alignment panVC [14] Multiple sequence variant detection whole genome alignment external yes no no yes SplitMEM [43] Graph construction multiple reference sequences external yes no no no svaha [9] Graph construction reference sequence + variants external yes no no no TwoPaCo [7] Graph construction multiple reference sequences external yes no no no vg [17] Pan-genome data structure reference sequence + variants external yes yes* yes* yes OR multiple reference sequences Table 1 Comparison of pan-genome tools. We analyzed tools for pan-genome analysis that are available or currently under development.…”
Section: Declarationsmentioning
confidence: 99%
“…Some ( [10][11][12]) focus on subsequent analyses such as mapping reads to the pan-genome, while others ( [13,14]) improve variant detection by using a set of reference sequences instead of a single one. The final category in our collection is made up by tools that introduce a complete data structure and provide methods for the construction, storage, processing and visualization of the pan-genome ( [3,13,[15][16][17]). Most of these tools depend on information on the (dis-)similarity of genomes from a multiple genome alignment or a reference sequence with an adjoining corresponding set of variants to create a pan-genome.…”
Section: Introductionmentioning
confidence: 99%
See 1 more Smart Citation
“…Efficient data structures for prefix sum, rank, and select queries exist [98], which can be used for the purpose of doing projections to and from a sequence and its gapped version as a row of an MSA. Multiple sequence alignments can be compactly represented by journaled string trees [111]. This data structure also allows for efficiently executing sequential algorithms on all genomes in the MSA simultaneously.…”
Section: Approachesmentioning
confidence: 99%
“…Several authors have shown that human genomes can be compactly represented and queried using reference-based compression (Layer et al, 2015;Purcell et al, 2007;Christley et al, 2009;Glusman et al, 2011;Grabowski, 2013, 2011;Kelleher et al, 2013;Wittelsbuerger et al, 2014;Rahn et al, 2014;Durbin, 2014;Fritz et al, 2011). Tiling allows for a compact representation and enables fast queries without decompression.…”
Section: Discussionmentioning
confidence: 99%