Indexes of Large Genome Collections on a PC

Danek, Agnieszka; Deorowicz, Sebastian; Grabowski, Szymon

doi:10.1371/journal.pone.0109384

Cited by 38 publications

(37 citation statements)

References 34 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Past efforts that evaluated graph aligners have been selective about what variants to include in the graph, but without a clear rationale. Some included all variants from a defined subset of strains or haplotypes [6,23,27] or from a database such as the 1000 Genomes Project callset [2] or dbSNP [32]. In some cases, variants were filtered according to ethnicity, e.g.…”

Section: Variant Selection and Evaluationmentioning

confidence: 99%

See 1 more Smart Citation

FORGe: prioritizing variants for graph genomes

Pritt

Langmead

2018

Preprint

View full text Add to dashboard Cite

There is growing interest in using genetic variants to augment the reference genome into a "graph genome" to improve read alignment accuracy and reduce allelic bias. While adding a variant has the positive effect of removing an undesirable alignmentscore penalty, it also increases both the ambiguity of the reference genome and the cost of storing and querying the genome index. We introduce methods and a software tool called FORGe for modeling these effects and prioritizing variants accordingly. We show that FORGe enables a range of advantageous and measurable trade-offs between accuracy and computational overhead. more variation to the reference eventually reduces alignment accuracy. We suggest efficient models for scoring variants according to the effect on accuracy and "blowup" (computational overhead), and further show that these scores can be used to achieve a balance of accuracy and overhead superior to current approaches. For example, extrapolating to a whole-human DNA sequencing experiment at 40-fold average coverage, we estimate that a well-engineered augmented reference can yield about 4.8M more correctly aligned reads and 1.2M fewer incorrectly aligned compared to the linear reference. Our methods for selecting variants also reduce reference bias, a chief goals of graph genomes. Finally, we compare the accuracy yielded by our methods to that achieved using an ideal personalized graph genome. We show that our methods approach the ideal much more closely than both linear genomes -even when they are modified to contain only major allelesand graph genomes built on different sets of variants.These methods are implemented in a new open source software tool called FORGe. We demonstrate FORGe in conjunction with the HISAT2 [12] graph aligner and with another aligner based on the Enhanced Reference Genome [7]. But FORGe's models and methods are suitable for any aligner that can include variants in the reference.

show abstract

Section: Variant Selection and Evaluationmentioning

confidence: 99%

“…GCSA2 [10] indexes paths in arbitrary graphs and is implemented in the VG software tool [11] which can align reads to such indexes. MuGI [27] and GraphTyper [21] use k-mer-based indexes.…”

Section: Introductionmentioning

confidence: 99%

FORGe: prioritizing variants for graph genomes

Pritt

Langmead

2018

Preprint

View full text Add to dashboard Cite

show abstract

“…Most existing indexing schemes for sequence graphs attempt to index k-mers in the graph, and they can broadly be categorized as being either hashing-based or BWTbased. The first hashing-based approach was introduced by Schneeberger et al (2009), and several related approaches based on hashing k-mers have been put forward since then (Danek et al, 2014;Limasset et al, 2016;Eggertsson et al, 2017;Petrov et al, 2018).…”

Section: Related Workmentioning

confidence: 99%

Fully-sensitive Seed Finding in Sequence Graphs Using a Hybrid Index

Ghaffaari

Marschall

2019

Preprint

View full text Add to dashboard Cite

Motivation: Sequence graphs are versatile data structures that are, for instance, able to represent the genetic variation found in a population and to facilitate genome assembly. Read mapping to sequence graphs constitutes an important step for many applications and is usually done by first finding exact seed matches, which are then extended by alignment. Existing methods for finding seed hits prune the graph in complex regions, leading to a loss of information especially in highly polymorphic regions of the genome. While such complex graph structures can indeed lead to a combinatorial explosion of possible alleles, the query set of reads from a diploid individual realizes only two alleles per locus-a property that is not exploited by extant methods. Results: We present the Pan-genome Seed Index (PSI), a fully-sensitive hybrid method for seed finding, which takes full advantage of this property by combining an index over selected paths in the graph with an index over the query reads. This enables PSI to find all seeds while eliminating the need to prune the graph. We demonstrate its performance with different parameter settings on both simulated data and on a whole human genome graph constructed from variants in the 1000 Genome Project data set. On this graph, PSI outperforms GCSA2 in terms of index size, query time, and sensitivity. Availability: The C++ implementation is publicly available at: https://github.com/cartoonist/psi.

show abstract

“…After finding seed occurrences for a read in this graph, the alignment was refined locally using dynamic programming. Similar k-mer indexing on sequence graphs has since been used and extended in several read mapping tools such as MuGI [128], BGREAT [129] and VG 8 .…”

Section: Read Mappingmentioning

confidence: 99%

Computational Pan-Genomics: Status, Promises and Challenges

Marschall¹,

Marz²,

Abeel³

et al. 2016

Preprint

View full text Add to dashboard Cite

Many disciplines, from human genetics and oncology to plant breeding, microbiology and virology, commonly face the challenge of analyzing rapidly increasing numbers of genomes. In case of Homo sapiens, the number of sequenced genomes will approach hundreds of thousands in the next few years. Simply scaling up established bioinformatics pipelines will not be sufficient for leveraging the full potential of such rich genomic datasets. Instead, novel, qualitatively different computational methods and paradigms are needed. We will witness the rapid extension of computational pan-genomics, a new sub-area of research in computational biology. In this paper, we generalize existing definitions and understand a pan-genome as any collection of genomic sequences to be analyzed jointly or to be used as a reference. We examine already available approaches to construct and use pan-genomes, discuss the potential benefits of future technologies and methodologies, and review open challenges from the vantage point of the above-mentioned biological disciplines. As a prominent example for a computational paradigm shift, we particularly highlight the transition from the representation of reference genomes as strings to representations as graphs. We outline how this and other challenges from different application domains translate into common computational problems, point out relevant bioinformatics techniques and identify open problems in computer science. With this review, we aim to increase awareness that a joint approach to computational pan-genomics can help address many of the problems currently faced in various domains. * The Computational Pan-Genomics Consortium formed at a workshop held June 8-12, 2015, at the Lorentz Center in Leiden, the Netherlands, with the purpose of providing a cross-disciplinary overview of the emerging discipline of Computational Pan-Genomics. Members are listed at the end of this article.

show abstract

Indexes of Large Genome Collections on a PC

Cited by 38 publications

References 34 publications

FORGe: prioritizing variants for graph genomes

FORGe: prioritizing variants for graph genomes

Fully-sensitive Seed Finding in Sequence Graphs Using a Hybrid Index

Computational Pan-Genomics: Status, Promises and Challenges

Contact Info

Product

Resources

About