Journaled string tree—a scalable data structure for analyzing thousands of similar genomes on your laptop

Rahn, René; Weese, David; Reinert, Knut

doi:10.1093/bioinformatics/btu438

Cited by 31 publications

(19 citation statements)

References 31 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…It is distributed under the FreeBSD License and available at https://gitlab.com/rki_bioinformatics. [11] Multiple sequence mapping reference sequence + variants no no no no no cdbg [2] Graph construction multiple reference sequences external yes no no no cdbg search [6] Graph construction multiple reference sequences external yes no no no GCSA [10] Graph indexing reference sequence + variants no no no no no Multiple sequence mapping GCSA2 [8] Graph indexing variation graph no no no no no GenomeMapper [12] Multiple sequence mapping reference sequence + variants no no no no no GenomeRing [3] Pan-genome data structure whole genome alignment yes yes no no yes JST [15] Pan-genome data structure reference sequence + variants no yes yes yes yes MHC-PRG [13] Pan-genome data structure multiple sequence alignment no no yes no no Multiple sequence AND variants variant detection PanCake [16] Pan-genome data structure multiple reference sequences external yes yes no no AND pairwise alignment panVC [14] Multiple sequence variant detection whole genome alignment external yes no no yes SplitMEM [43] Graph construction multiple reference sequences external yes no no no svaha [9] Graph construction reference sequence + variants external yes no no no TwoPaCo [7] Graph construction multiple reference sequences external yes no no no vg [17] Pan-genome data structure reference sequence + variants external yes yes* yes* yes OR multiple reference sequences Table 1 Comparison of pan-genome tools. We analyzed tools for pan-genome analysis that are available or currently under development.…”

Section: Declarationsmentioning

confidence: 99%

“…Some ( [10][11][12]) focus on subsequent analyses such as mapping reads to the pan-genome, while others ( [13,14]) improve variant detection by using a set of reference sequences instead of a single one. The final category in our collection is made up by tools that introduce a complete data structure and provide methods for the construction, storage, processing and visualization of the pan-genome ( [3,13,[15][16][17]). Most of these tools depend on information on the (dis-)similarity of genomes from a multiple genome alignment or a reference sequence with an adjoining corresponding set of variants to create a pan-genome.…”

Section: Introductionmentioning

confidence: 99%

“…While four of the analyzed tools -JST ( [15]), MHC-PRG ( [13]), PanCake ( [16]), and vg ( [17]) -provide methods for adding or removing genomes from the pan-genome data structure, only GenomeRing ( [3]), JST ( [15]), panVC ( [14]) and vg ( [17]) offer the ability to annotate biological features. This is often caused by the representation of the pan-genome as graphs, for which there is no standard method providing a coordinate system, which severely complicates the use of existing annotation databases and formats.…”

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

seq-seq-pan: Building a computational pan-genome data structure on whole genome alignment

Jandrasits

Dabrowski

Fuchs

et al. 2017

Preprint

View full text Add to dashboard Cite

Jandrasits et al. METHODOLOGYseq-seq-pan: Building a computational pan-genome data structure on whole genome alignment Christine Jandrasits 1 , Piotr W Dabrowski 1 , Stephan Fuchs 2 and Bernhard Y Renard 1* Abstract Background: The increasing application of next generation sequencing technologies has led to the availability of thousands of reference genomes, often providing multiple genomes for the same or closely related species. The current approach to represent a species or a population with a single reference sequence and a set of variations cannot represent their full diversity and introduces bias towards the chosen reference. There is a need for the representation of multiple sequences in a composite way that is compatible with existing data sources for annotation and suitable for established sequence analysis methods. At the same time, this representation needs to be easily accessible and extendable to account for the constant change of available genomes. Results: We introduce seq-seq-pan, a framework that provides methods for adding or removing new genomes from a set of aligned genomes and uses these to construct a whole genome alignment. Throughout the sequential workflow the alignment is optimized for generating a representative linear presentation of the aligned set of genomes, that enables its usage for annotation and in downstream analyses. Conclusions: By providing dynamic updates and optimized processing, our approach enables the usage of whole genome alignment in the field of pan-genomics. In addition, the sequential workflow can be used as a fast alternative to existing whole genome aligners.seq-seq-pan is freely available at https://gitlab.com/groups/rki_bioinformatics

show abstract

Section: Declarationsmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

seq-seq-pan: Building a computational pan-genome data structure on whole genome alignment

Jandrasits

Dabrowski

Fuchs

et al. 2017

Preprint

View full text Add to dashboard Cite

show abstract

“…Efficient data structures for prefix sum, rank, and select queries exist [98], which can be used for the purpose of doing projections to and from a sequence and its gapped version as a row of an MSA. Multiple sequence alignments can be compactly represented by journaled string trees [111]. This data structure also allows for efficiently executing sequential algorithms on all genomes in the MSA simultaneously.…”

Section: Approachesmentioning

confidence: 99%

Computational Pan-Genomics: Status, Promises and Challenges

Marschall¹,

Marz²,

Abeel³

et al. 2016

Preprint

View full text Add to dashboard Cite

Many disciplines, from human genetics and oncology to plant breeding, microbiology and virology, commonly face the challenge of analyzing rapidly increasing numbers of genomes. In case of Homo sapiens, the number of sequenced genomes will approach hundreds of thousands in the next few years. Simply scaling up established bioinformatics pipelines will not be sufficient for leveraging the full potential of such rich genomic datasets. Instead, novel, qualitatively different computational methods and paradigms are needed. We will witness the rapid extension of computational pan-genomics, a new sub-area of research in computational biology. In this paper, we generalize existing definitions and understand a pan-genome as any collection of genomic sequences to be analyzed jointly or to be used as a reference. We examine already available approaches to construct and use pan-genomes, discuss the potential benefits of future technologies and methodologies, and review open challenges from the vantage point of the above-mentioned biological disciplines. As a prominent example for a computational paradigm shift, we particularly highlight the transition from the representation of reference genomes as strings to representations as graphs. We outline how this and other challenges from different application domains translate into common computational problems, point out relevant bioinformatics techniques and identify open problems in computer science. With this review, we aim to increase awareness that a joint approach to computational pan-genomics can help address many of the problems currently faced in various domains. * The Computational Pan-Genomics Consortium formed at a workshop held June 8-12, 2015, at the Lorentz Center in Leiden, the Netherlands, with the purpose of providing a cross-disciplinary overview of the emerging discipline of Computational Pan-Genomics. Members are listed at the end of this article.

show abstract

“…Several authors have shown that human genomes can be compactly represented and queried using reference-based compression (Layer et al, 2015;Purcell et al, 2007;Christley et al, 2009;Glusman et al, 2011;Grabowski, 2013, 2011;Kelleher et al, 2013;Wittelsbuerger et al, 2014;Rahn et al, 2014;Durbin, 2014;Fritz et al, 2011). Tiling allows for a compact representation and enables fast queries without decompression.…”

Section: Discussionmentioning

confidence: 99%

Untitled

Supplemental Information 10: Figure S2: Projection of 178 PGP Whole Genome Sequences Along Their First Two Principal Components

View full text Add to dashboard Cite

The scientific and medical community is reaching an era of inexpensive whole genome sequencing, opening the possibility of precision medicine for millions of individuals. Here we present tiling: a flexible representation of whole genome sequences that supports simple and consistent names, annotation, queries, machine learning, and clinical screening. We partitioned the genome into 10,655,006 tiles: overlapping, variable-length sequences that begin and end with unique 24-base tags. We tiled and annotated 680 public whole genome sequences from the 1000 Genomes Project Consortium (1KG) and Harvard Personal Genome Project (PGP) using ClinVar database information. These genomes cover 14.13 billion tile sequences (4.087 trillion high quality bases and 0.4321 trillion low quality bases) and 251 phenotypes spanning ICD-9 code ranges 140-289, 320-629, and 680-759. We used these data to build a Global Alliance for Genomics and Health Beacon and graph database. We performed principal component analysis (PCA) on the 680 public whole genomes, and by projecting the tiled genomes onto their first two principal components, we replicated the 1KG principle component separation by population ethnicity codes. Interestingly, we found the PGP self reported ethnicities cluster consistently with 1KG ethnicity codes. We built a set of support-vector ABO blood-type classifiers using 75 PGP participants who had both a whole genome sequence and a self-reported blood type. Our classifier predicts A antigen presence to within 1% of the current state-of-the art for in silico A antigen prediction. Finally, we found six PGP participants with previously undiscovered pathogenic BRCA variants, and using our tiling, gave them simple, consistent names, which can be easily and independently re-derived. Given the near-future requirements of genomics research and precision medicine, we propose the adoption of tiling and invite all interested individuals and groups to view, rerun, copy, and modify these analyses at https://curover.se/su92l-j7d0g-swtofxa2rct8495. RESULTSAll results described here may be found, replicated, and rerun on different data using Arvados at

show abstract

Journaled string tree—a scalable data structure for analyzing thousands of similar genomes on your laptop

Abstract: In this work, we provide a datatype that exploits data parallelism inherent in a set of similar sequences by analyzing shared regions only once. In real-world experiments, we show that algorithms that otherwise would scan each reference sequentially can be speeded up by a factor of 115.

Cited by 31 publications

References 31 publications

seq-seq-pan: Building a computational pan-genome data structure on whole genome alignment

seq-seq-pan: Building a computational pan-genome data structure on whole genome alignment

Computational Pan-Genomics: Status, Promises and Challenges

Untitled

Contact Info

Product

Resources

About