FORGe: prioritizing variants for graph genomes

Pritt, Jacob; Chen, Nae-Chyun; Langmead, Ben

doi:10.1186/s13059-018-1595-x

Cited by 75 publications

(59 citation statements)

References 51 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Full; graph encoding all 1000G variation in chromosome 6 (excluding NA12878), Min2; graph encoding 196 only variations that were observed in at least two individuals; PopCov10+; graph encoding the top 10% 197 scoring variations as scored by FORGe [30], which weighs variants by allele frequency in the population 198 and minimizes graph complexity. Figure 4a shows the fractions of reads that are correctly and incorrectly 199 aligned onto the different reference genomes.…”

mentioning

confidence: 99%

CHOP: Haplotype-aware path indexing in population graphs

Mokveld

Linthorst

Al-Ars

et al. 2018

Preprint

View full text Add to dashboard Cite

The practical use of graph-based reference genomes depends on the ability to align reads to them.Performing substring queries to paths through these graphs lies at the core of this task. The combination of increasing pattern length and encoded variations inevitably leads to a combinatorial explosion of the search space. We propose CHOP a method that uses haplotype information to prevent this from happening. We show that CHOP can be applied to large and complex datasets, by applying it on a graph-based representation of the human genome encoding all 80 million variants reported by the 1000 Genomes project. Pangenomes and their graphical representations have become widespread in the domain of sequencing 1 analysis [1]. Part of this adoption is driven by the increased characterization of within species genomic 2 diversity. For instance, recent versions of the human reference genome (GRCh37 and up), include 3 sequences that represent highly polymorphic regions in the human population [2]. 4A pangenome can be constructed by integrating known variants in the linear reference genome. This 5 way, a pangenome can incorporate sequence diversity in ways that a typical linear reference genome 6 cannot. For example aligning reads to a linear reference genome can lead to an over-representation of 7 the reference allele. This effect, known as reference allele bias, influences highly polymorphic regions 8 and/or regions that are absent from the reference [3,4]. By integrating variants into the alignment 9 process, this bias can be reduced [5][6][7]. As a consequence, variant calling can be improved, with fewer 10 erroneous variants induced by misalignments around indels, and fewer missed variants [8]. An intuitive 11 representation for pangenomes are graph data structures, which are often referred to as population 12 graphs [1,9]. Population graphs can be understood as compressed representations of multiple genomes, 13 with sequence generally represented on the nodes. These nodes are in turn connected by directed edges, 14 such that the full sequence of any genome used to construct the graph can be determined by a specific 15 path traversal through the graph. Alternatively, an arbitrary traversal through the graph will yield a 16 mixture of genomes. 17A key application for reference genomes is read alignment. Most of the linear reference read aligners 18 follow a seed-and-extent paradigm, wherein exact matching substrings (seeds) between the read and a 19 reference are used to constrain a local alignment. To efficiently search for exactly matching substrings 20 (seeding), indexing data structures are used. The construction of these indexes generally relies on 21 one of two methods: k-mer-based indexing, where all substrings of length k are stored in a hash-map 22 along with their positions within the sequence; and sorting-based methods such as the Burrows-Wheeler 23 Transform (BWT), where the reference sequence is transformed into a self-index that supports the lookup 24 of exact-matching substrings of arbitrary length. 25Existi...

show abstract

mentioning

confidence: 99%

CHOP: Haplotype-aware path indexing in population graphs

Mokveld

Linthorst

Al-Ars

et al. 2018

Preprint

View full text Add to dashboard Cite

show abstract

“…Though the default scoring functions of tools like BWA-MEM and Bowtie 2 are widely used, they are not very well studied, and this is in large part because it is difficult to separate the effect of the scoring function from the closely related effects of the heuristics. Vargas alignments could also be used to evaluate the effects of different reference genomes on alignment accuracy, such as comparing graph genomes containing different variant sets to each other and to linear references, as investigated using simulation in the FORGe study (Pritt et al, 2018).…”

Section: Discussionmentioning

confidence: 99%

“…While most current heuristic and heuristic-free read alignment algorithms assume that the reference genome is linear, with greater understanding of genetic diversity has come increasing focus on alternatives to the linear reference genome. Various solutions have been proposed that incorporate information about genetic variation in the population, including graph-shaped reference genomes (Paten et al, 2017), pan-genomes (Yang et al, 2019), and a genome that contains the most common (major) allele at each variable site (Pritt et al, 2018;Ballouz et al, 2019). The most recent human reference genome assembly, GRCh38, includes alternate assemblies for hypervariable loci (Church et al, 2015).…”

Section: Introductionmentioning

confidence: 99%

Vargas: heuristic-free alignment for assessing linear and graph read aligners

Darby

Gaddipati

Schatz

et al. 2019

Preprint

Self Cite

View full text Add to dashboard Cite

AbstractRead alignment is central to many aspects of modern genomics. Most aligners use heuristics to accelerate processing, but these heuristics can fail to find the optimal alignments of reads. Alignment accuracy is typically measured through simulated reads; however, the simulated location may not be the (only) location with the optimal alignment score. Vargas implements a heuristic-free algorithm guaranteed to find the highest-scoring alignment for real sequencing reads to a linear or graph genome. With semiglobal and local alignment modes and affine gap and quality-scaled mismatch penalties, it can implement the scoring functions of commonly used aligners to calculate optimal alignments. While this is computationally intensive, Vargas uses multi-core parallelization and vectorized (SIMD) instructions to make it practical to optimally align large numbers of reads, achieving a maximum speed of 456 billion cell updates per second. We demonstrate how these “gold standard” Vargas alignments can be used to improve heuristic alignment accuracy by optimizing command-line parameters in Bowtie 2, BWA-MEM, and vg to align more reads correctly. Source code implemented in C++ and compiled binary releases are available at https://github.com/langmead-lab/vargas under the MIT license.

show abstract

“…This idea could naturally be combined with our method by replacing the path selection step accordingly, which we plan to explore in future research. Beyond that, Pritt et al (2018) have argued that it might be beneficial to restrict the set of variants used for graph construction to a well-selected subset for two reasons: to avoid introducing unnecessary ambiguity and to simplify indexing. By providing a full-sensitivity index, we have removed the necessity for the latter, creating the opportunity for comprehensive evaluations on the trade-off between added ambiguity and reduced read mapping bias.…”

Section: Discussionmentioning

confidence: 99%

Fully-sensitive Seed Finding in Sequence Graphs Using a Hybrid Index

Ghaffaari

Marschall

2019

Preprint

View full text Add to dashboard Cite

Motivation: Sequence graphs are versatile data structures that are, for instance, able to represent the genetic variation found in a population and to facilitate genome assembly. Read mapping to sequence graphs constitutes an important step for many applications and is usually done by first finding exact seed matches, which are then extended by alignment. Existing methods for finding seed hits prune the graph in complex regions, leading to a loss of information especially in highly polymorphic regions of the genome. While such complex graph structures can indeed lead to a combinatorial explosion of possible alleles, the query set of reads from a diploid individual realizes only two alleles per locus-a property that is not exploited by extant methods. Results: We present the Pan-genome Seed Index (PSI), a fully-sensitive hybrid method for seed finding, which takes full advantage of this property by combining an index over selected paths in the graph with an index over the query reads. This enables PSI to find all seeds while eliminating the need to prune the graph. We demonstrate its performance with different parameter settings on both simulated data and on a whole human genome graph constructed from variants in the 1000 Genome Project data set. On this graph, PSI outperforms GCSA2 in terms of index size, query time, and sensitivity. Availability: The C++ implementation is publicly available at: https://github.com/cartoonist/psi.

show abstract

FORGe: prioritizing variants for graph genomes

Cited by 75 publications

References 51 publications

CHOP: Haplotype-aware path indexing in population graphs

CHOP: Haplotype-aware path indexing in population graphs

Vargas: heuristic-free alignment for assessing linear and graph read aligners

Fully-sensitive Seed Finding in Sequence Graphs Using a Hybrid Index

Contact Info

Product

Resources

About