Prefix-free parsing for building big BWTs

Boucher, Christina; Gagie, Travis; Kuhnle, Alan; Langmead, Ben; Manzini, Giovanni; Mun, Taher

doi:10.1186/s13015-019-0148-5

Cited by 58 publications

(89 citation statements)

References 18 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Here, we describe our algorithm for building the SA or the sampled SA from the prefix free parse of a input string S, which is used to build the r-index. We first review the algorithm from [2] for building the BWT of S from the prefix free parse. Next, we show how to modify this construction to compute the SA or the sampled SA along with the BWT.…”

Section: Methodsmentioning

confidence: 99%

“…It takes as input string S, and in one-pass generates a dictionary and a parse of S with the property that the BWT can be constructed from dictionary and parse using workspace proportional to their total size and O(|S|) time. Yet, the resulting index of Boucher et al [2] has no SA sample, and therefore, only supports counting and not locating. This makes this index not directly applicable to many bioinformatic applications, such as sequence alignment.…”

Section: Introductionmentioning

confidence: 91%

“…While their result yields a potentially practical FM-index on massive databases, it does not directly lead to a solution since the problem of how to efficiently construct the BWT and SA sample remained open. In a direction toward to fully realizing the theoretical result of Gagie et al [11], Boucher et al [2] showed how to build the BWT of large genomic databases efficiently. We refer to this construction as prefix-free parsing.…”

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

Efficient Construction of a Complete Index for Pan-Genomics Read Alignment

Kuhnle

Mun²,

Boucher

et al. 2019

Lecture Notes in Computer Science

Self Cite

View full text Add to dashboard Cite

While short read aligners, which predominantly use the FM-index, are able to easily index one or a few human genomes, they do not scale well to indexing databases containing thousands of genomes. To understand why, it helps to examine the main components of the FM-index in more detail, which is a rank data structure over the Burrows-Wheeler Transform (BWT) of the string that will allow us to find the interval in the string's suffix array (SA) containing pointers to starting positions of occurrences of a given pattern; second, a sample of the SA that -when used with the rank data structure -allows us access the SA. The rank data structure can be kept small even for large genomic databases, by run-length compressing the BWT, but until recently there was no means known to keep the SA sample small without greatly slowing down access to the SA. Now that Gagie et al. (SODA 2018) have defined an SA sample that takes about the same space as the run-length compressed BWT -we have the design for efficient FM-indexes of genomic databases but are faced with the problem of building them. In 2018 we showed how to build the BWT of large genomic databases efficiently (WABI 2018) but the problem of building Gagie et al.'s SA sample efficiently was left open. We compare our approach to state-of-the-art methods for constructing the SA sample, and demonstrate that it is the fastest and most space-efficient method on highly repetitive genomic databases. Lastly, we apply our method for indexing partial and whole human genomes, and show that it improves over Bowtie with respect to both memory and time.Availability: We note that the implementation of our methods can be found here: https://github. com/alshai/r-index. Equal contribution

show abstract

Section: Methodsmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 91%

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Efficient Construction of a Complete Index for Pan-Genomics Read Alignment

Kuhnle

Mun²,

Boucher

et al. 2019

Lecture Notes in Computer Science

Self Cite

View full text Add to dashboard Cite

show abstract

“…Users should first download some prerequisite packages, and the source code from the github repos- These commands will install the binaries ri-buildfasta and ri-align in the system's default bin location (e.g., /usr/local/bin for Ubuntu users), together with bigbwt [1] and the SDSL library [5] (if it is not already present). If users want the binaries elsewhere, then they should use $ cmake -DCMAKE_INSTALL_PREFIX=<dest> ..…”

Section: Installationmentioning

confidence: 99%

“…Building on previous authors' work [11], Gagie, Navarro and Prezza [4] described how a fully functional variant of the FM-index for such a database could be stored in reasonable space: their variant takes O(r) machine words, where r is the number of runs in the BWT of the database, and thus is called the r-index. Prezza [14] gave a preliminary implementation, which was significantly extended by Boucher et al [1] and Kuhnle et al [6]. This paper is meant as a brief guide to the extended implementation.…”

Section: Introductionmentioning

confidence: 99%

Matching Reads to Many Genomes with the r-Index

Mun

Kuhnle

Boucher

et al. 2020

Journal of Computational Biology

Self Cite

View full text Add to dashboard Cite

The r-index is a tool for compressed indexing of genomic databases for exact pattern matching, which can be used to completely align reads that perfectly match some part of a genome in the database or to find seeds for reads that do not. This paper shows how to download and install the programs ri-buildfasta and ri-align ; how to call ri-buildfasta on a FASTA file to build an r-index for that file; and how to query that index with ri-align .Availability: The source code for these programs is released under GPLv3 and available at https://github.com/alshai/r-index.

show abstract

Rpair: Rescaling RePair with Rsync

Gagie

Tomohiro

Manzini

et al. 2019

Lecture Notes in Computer Science

Self Cite

View full text Add to dashboard Cite

Data compression is a powerful tool for managing massive but repetitive datasets, especially schemes such as grammar-based compression that support computation over the data without decompressing it. In the best case such a scheme takes a dataset so big that it must be stored on disk and shrinks it enough that it can be stored and processed in internal memory. Even then, however, the scheme is essentially useless unless it can be built on the original dataset reasonably quickly while keeping the dataset on disk. In this paper we show how we can preprocess such datasets with context-triggered piecewise hashing such that afterwards we can apply RePair and other grammar-based compressors more easily. We first give our algorithm, then show how a variant of it can be used to approximate the LZ77 parse, then leverage that to prove theoretical bounds on compression, and finally give experimental evidence that our approach is competitive in practice.

show abstract

Prefix-free parsing for building big BWTs

Cited by 58 publications

References 18 publications

Efficient Construction of a Complete Index for Pan-Genomics Read Alignment

Efficient Construction of a Complete Index for Pan-Genomics Read Alignment

Matching Reads to Many Genomes with the r-Index

Rpair: Rescaling RePair with Rsync

Contact Info

Product

Resources

About