Efficient Construction of a Complete Index for Pan-Genomics Read Alignment

Kuhnle, Alan; Mun, Taher; Boucher, Christina; Gagie, Travis; Langmead, Ben; Manzini, Giovanni

doi:10.1101/472423

Cited by 16 publications

(26 citation statements)

References 26 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…This might be accomplished using unsupervised, sequence-driven clustering methods [34,35], using the "founder sequence" framework [36,37], or using some form of submodular optimization [38]. A more radical idea is to simply index all available individuals, forgoing the need to choose representatives; this is becoming more practical with the advent of new approaches for haplotype-aware path indexing [31] and efficient indexing for repetitive texts [39].…”

Section: Discussionmentioning

confidence: 99%

Reference flow: reducing reference bias using multiple population genomes

et al. 2021

Self Cite

View full text Add to dashboard Cite

Most sequencing data analyses start by aligning sequencing reads to a linear reference genome, but failure to account for genetic variation leads to reference bias and confounding of results downstream. Other approaches replace the linear reference with structures like graphs that can include genetic variation, incurring major computational overhead. We propose the reference flow alignment method that uses multiple population reference genomes to improve alignment accuracy and reduce reference bias. Compared to the graph aligner vg, reference flow achieves a similar level of accuracy and bias avoidance but with 14% of the memory footprint and 5.5 times the speed.

show abstract

Section: Discussionmentioning

confidence: 99%

Reference flow: reducing reference bias using multiple population genomes

et al. 2021

Self Cite

View full text Add to dashboard Cite

show abstract

“…This might be accomplished using unsupervised, sequence-driven clustering methods 36,37 , using the "founder sequence" framework 38,39 , or using some form of submodular optimization 40 . A more radical idea is to simply index all available individuals, forgoing the need to choose representatives; this is becoming more practical with the advent of new approaches for haplotype-aware path indexing 33 and efficient indexing for repetitive texts 41 .…”

Section: Discussionmentioning

confidence: 99%

Reducing reference bias using multiple population reference genomes

Chen

Solomon

Mun

et al. 2020

Preprint

Self Cite

View full text Add to dashboard Cite

Most sequencing data analyses start by aligning sequencing reads to a linear reference genome. But failure to account for genetic variation causes reference bias and confounding of results downstream. Other approaches replace the linear reference with structures like graphs that can include genetic variation, incurring major computational overhead. We propose the "reference flow" alignment method that uses information from multiple population reference genomes to improve alignment accuracy and reduce reference bias. Compared to the graph aligner vg, reference flow exhibits a similar level of accuracy and bias avoidance, but with 13% of the memory footprint and 6 times the speed.

show abstract

“…Before that promise can be fulfilled, however, several obstacles must still be overcome: first, we need efficient algorithms to build RLBWTs and SA samples of genomic databases, which are the main components of r-indexes; second, we need an efficient way to update the r-index when we add a new genome to the database, because rebuilding it regularly will be prohibitively slow regardless of the algorithms we use; and third, as reads become longer and more likely to contain combinations of variation that we have seen before individually but not all together, we will need support for finding maximal exact matches between the read and the database. Boucher et al [14,15] and Kuhnle et al [16] have since made substantial progress on the first point, and in this paper we address the second one and give a theoretical solution to the third. As a by-product of making the r-index dynamic, we obtain an online algorithm for computing the LZ77 parse in space bounded in terms of the number of runs in the BWT.…”

Section: T T C a G A T T A A C A T T T G A T A A C A T G A T T A C A ...mentioning

confidence: 91%