Efficient Construction of a Complete Index for Pan-Genomics Read Alignment

Kuhnle, Alan; Mun, Taher; Boucher, Christina; Gagie, Travis; Langmead, Ben; Manzini, Giovanni

doi:10.1007/978-3-030-17083-7_10

Cited by 11 publications

(8 citation statements)

References 36 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…• bigbwt: Use a so-called prefix-free parsing technique, which is shown to be useful to reduce the working space and at the same time accelerate BWT construction [14,16].…”

Section: Resultsmentioning

confidence: 99%

“…Before that promise can be fulfilled, however, several obstacles must still be overcome: first, we need efficient algorithms to build RLBWTs and SA samples of genomic databases, which are the main components of r-indexes; second, we need an efficient way to update the r-index when we add a new genome to the database, because rebuilding it regularly will be prohibitively slow regardless of the algorithms we use; and third, as reads become longer and more likely to contain combinations of variation that we have seen before individually but not all together, we will need support for finding maximal exact matches between the read and the database. Boucher et al [14,15] and Kuhnle et al [16] have since made substantial progress on the first point, and in this paper we address the second one and give a theoretical solution to the third. As a by-product of making the r-index dynamic, we obtain an online algorithm for computing the LZ77 parse in space bounded in terms of the number of runs in the BWT.…”

Section: T T C a G A T T A A C A T T T G A T A A C A T G A T T A C A mentioning

confidence: 91%

“…In Section 2 we review some previous results that we will use throughout this paper, and strengthen Policriti and Prezza's Toehold Lemma to require SA entries only at the beginnings of the runs in the BWT -which significantly improves the practical performance of the r-index [16] -and simplify its proof. In Section 3 we show how to update the r-index efficiently when adding a new genome to the database, and in Section 4 we show how that can be applied to compute the LZ77 parse online from a growing r-index.…”

Section: T T C a G A T T A A C A T T T G A T A A C A T G A T T A C A mentioning

confidence: 93%

See 2 more Smart Citations

Refining the r-index

Bannai

Gagie

Tomohiro

2020

Theoretical Computer Science

Self Cite

View full text Add to dashboard Cite

“…• bigbwt: Use a so-called prefix-free parsing technique, which is shown to be useful to reduce the working space and at the same time accelerate BWT construction [14,16].…”

Section: Resultsmentioning

confidence: 99%

Section: T T C a G A T T A A C A T T T G A T A A C A T G A T T A C A mentioning

confidence: 91%

Section: T T C a G A T T A A C A T T T G A T A A C A T G A T T A C A mentioning

confidence: 93%

See 1 more Smart Citation

Refining the r-index

Bannai

Gagie

Tomohiro

2020

Theoretical Computer Science

Self Cite

View full text Add to dashboard Cite

“…Building on previous authors' work [11], Gagie, Navarro and Prezza [4] described how a fully functional variant of the FM-index for such a database could be stored in reasonable space: their variant takes O(r) machine words, where r is the number of runs in the BWT of the database, and thus is called the r-index. Prezza [14] gave a preliminary implementation, which was significantly extended by Boucher et al [1] and Kuhnle et al [6]. This paper is meant as a brief guide to the extended implementation.…”

Section: Introductionmentioning

confidence: 99%

Matching Reads to Many Genomes with the r-Index

Mun

Kuhnle

Boucher

et al. 2020

Journal of Computational Biology

Self Cite

View full text Add to dashboard Cite

The r-index is a tool for compressed indexing of genomic databases for exact pattern matching, which can be used to completely align reads that perfectly match some part of a genome in the database or to find seeds for reads that do not. This paper shows how to download and install the programs ri-buildfasta and ri-align ; how to call ri-buildfasta on a FASTA file to build an r-index for that file; and how to query that index with ri-align .Availability: The source code for these programs is released under GPLv3 and available at https://github.com/alshai/r-index.

show abstract

“…There is a theoretical proposal for supporting fast locate() queries in space proportional to the size of the run-length encoded BWT (Gagie et al, 2018). While there has been some progress in building the proposed index for large datasets (Kuhnle et al, 2019), scaling it up to TOPMed scale is still an open problem.…”

mentioning

confidence: 99%

Haplotype-aware graph indexes

Sirén

Garrison

Novak

et al. 2019

Preprint

View full text Add to dashboard Cite

Motivation:The variation graph toolkit (VG) represents genetic variation as a graph. Although each path in the graph is a potential haplotype, most paths are nonbiological, unlikely recombinations of true haplotypes. Results:We augment the VG model with haplotype information to identify which paths are more likely to exist in nature. For this purpose, we develop a scalable implementation of the graph extension of the positional Burrows-Wheeler transform (GBWT). We demonstrate the scalability of the new implementation by building a whole-genome index of the 5,008 haplotypes of the 1000 Genomes Project, and an index of all 108,070 TOPMed Freeze 5 chromosome 17 haplotypes. We also develop an algorithm for simplifying variation graphs for k-mer indexing without losing any k-mers in the haplotypes.

show abstract

Efficient Construction of a Complete Index for Pan-Genomics Read Alignment

Cited by 11 publications

References 36 publications

Refining the r-index

Refining the r-index

Matching Reads to Many Genomes with the r-Index

Haplotype-aware graph indexes

Contact Info

Product

Resources

About