2018
DOI: 10.1101/472423
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Efficient Construction of a Complete Index for Pan-Genomics Read Alignment

Abstract: While short read aligners, which predominantly use the FM-index, are able to easily index one or a few human genomes, they do not scale well to indexing databases containing thousands of genomes. To understand why, it helps to examine the main components of the FM-index in more detail, which is a rank data structure over the Burrows-Wheeler Transform (BWT) of the string that will allow us to find the interval in the string's suffix array (SA) containing pointers to starting positions of occurrences of a given … Show more

Help me understand this report
View published versions

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
26
0

Year Published

2020
2020
2022
2022

Publication Types

Select...
4
3
1

Relationship

6
2

Authors

Journals

citations
Cited by 16 publications
(26 citation statements)
references
References 26 publications
0
26
0
Order By: Relevance
“…This might be accomplished using unsupervised, sequence-driven clustering methods [34,35], using the "founder sequence" framework [36,37], or using some form of submodular optimization [38]. A more radical idea is to simply index all available individuals, forgoing the need to choose representatives; this is becoming more practical with the advent of new approaches for haplotype-aware path indexing [31] and efficient indexing for repetitive texts [39].…”
Section: Discussionmentioning
confidence: 99%
“…This might be accomplished using unsupervised, sequence-driven clustering methods [34,35], using the "founder sequence" framework [36,37], or using some form of submodular optimization [38]. A more radical idea is to simply index all available individuals, forgoing the need to choose representatives; this is becoming more practical with the advent of new approaches for haplotype-aware path indexing [31] and efficient indexing for repetitive texts [39].…”
Section: Discussionmentioning
confidence: 99%
“…This might be accomplished using unsupervised, sequence-driven clustering methods 36,37 , using the "founder sequence" framework 38,39 , or using some form of submodular optimization 40 . A more radical idea is to simply index all available individuals, forgoing the need to choose representatives; this is becoming more practical with the advent of new approaches for haplotype-aware path indexing 33 and efficient indexing for repetitive texts 41 .…”
Section: Discussionmentioning
confidence: 99%
“…Before that promise can be fulfilled, however, several obstacles must still be overcome: first, we need efficient algorithms to build RLBWTs and SA samples of genomic databases, which are the main components of r-indexes; second, we need an efficient way to update the r-index when we add a new genome to the database, because rebuilding it regularly will be prohibitively slow regardless of the algorithms we use; and third, as reads become longer and more likely to contain combinations of variation that we have seen before individually but not all together, we will need support for finding maximal exact matches between the read and the database. Boucher et al [14,15] and Kuhnle et al [16] have since made substantial progress on the first point, and in this paper we address the second one and give a theoretical solution to the third. As a by-product of making the r-index dynamic, we obtain an online algorithm for computing the LZ77 parse in space bounded in terms of the number of runs in the BWT.…”
Section: T T C a G A T T A A C A T T T G A T A A C A T G A T T A C A ...mentioning
confidence: 91%