Hybrid indexes for repetitive datasets

Ferrada, Héctor; Gagie, Travis; Hirvola, Tommi; Puglisi, Simon J.

doi:10.1098/rsta.2013.0137

“…This short overview supports our claim that industry-level multiplegenome read mappers are yet to come. There are also a number of theoretical works dedicated to indexing text with wildcard positions (Thachuk, 2013;Hon et al, 2013), where the wildcards represent SNPs, or the more general problem of indexing repetitive data with support for exact or approximate matching (Gagie et al, 2011;Jansson et al, 2014;Ferrada et al, 2014). None of them, however, can be considered a breakthrough, at least for bioinformatics, since none of them was demonstrated to run on multi-gigabyte genomic data (and in some of the cited papers no experimental results are given at all).…”

Section: Introductionmentioning

confidence: 99%

Whisper: Read sorting allows robust mapping of sequencing data

Deorowicz

¹

,

Debudaj-Grabysz

²

,

Gudyś

³

et al. 2017

Preprint

0

View full text Add to dashboard Cite

Motivation: Mapping reads to a reference genome is often the first step in a sequencing data analysis pipeline. Mistakes made at this computationally challenging stage cannot be recovered easily. Results: We present Whisper, an accurate and high-performant mapping tool, based on the idea of sorting reads and then mapping them against suffix arrays for the reference genome and its reverse complement. Employing task and data parallelism as well as storing temporary data on disk result in superior time efficiency at reasonable memory requirements. Whisper excels at large NGS read collections, in particular Illumina reads with typical WGS coverage. The experiments with real data indicate that our solution works in about 15% of the time needed by the well-known Bowtie2 and BWA-MEM tools at a comparable accuracy (validated in variant calling pipeline).

show abstract

“…The next section reviews basic data structural tools on which hybrid indexing depends. Section 3 then gives an overview of hybrid indexing, as described by Ferrada et al [3]. We then describe our implementation of this basic scheme in Section 4.…”

Section: Our Contributionmentioning

confidence: 99%

“…We call the first type of occurrences primary occurrences and the remaining ones (which must necessarily be completely contained inside LZ phrases) secondary occurrences. The hybrid index [3] reports the primary and secondary occurrences of a query pattern using separate structures, which we now review (see also [7]) 2 3.1 Finding Primary Occurrences. For a given upper bound M on pattern length, let T M be the string containing the characters of T within distance M of their nearest LZ phrase boundaries; characters not adjacent in T are separated in T M by a special character # not in the alphabet of T .…”

Section: The Hybrid Indexmentioning

confidence: 99%

“…Recently, in an attempt to address this standoff, Ferrada et al [3] described hybrid indexing -an algorithmic technique by which any conventional pattern matching index (including any read aligner) can be made to scale to large, highly compressible collections via means of the Lempel-Ziv (LZ77) parsing [30,14,11], a method from data compression (we give a formal definition shortly). In particular, given an upper bound M on the searchable pattern length, the first step of hybrid indexing is to obtain a filtered string consisting of the concatenation of the M -length substrings to the left and right of each LZ77 phrase boundary.…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

2018 Proceedings of the Twentieth Workshop on Algorithm Engineering and Experiments (ALENEX)

Pagh¹,

Venkatasubramanian²

2018

0

View full text Add to dashboard Cite

Hybrid indexing is a recent approach to text indexing that allows the space-usage of conventional text indexes (e.g., suffix trees, suffix arrays, FM-indexes) to scale well with the text size, n, when z, the size of the Lempel-Ziv parsing of the text, is small relative to n. The price for this improved scalability is that an upper bound M on the pattern length that can be searched for must be declared at index construction time. Because the size of the resulting index contains an O(M z) term, M must be kept reasonably small, though it has been shown that M ≈ 100 leads to acceptable performance in some genomic applications. However, despite its promise, the practical performance of hybrid indexing relative to other compressed index data structures is poorly understood. This paper addresses that need, detailing experiments that show hybrid indexing -when carefully implemented -to be significantly smaller and faster than alternative approaches on a broad range of data of different levels of compressibility. We also describe practical extensions to hybrid indexing that obviate the restriction on M , supporting search for patterns of arbitrary length.

show abstract

“…Ferrada et al [3] store a conventional patten matching index 3 I M on T M . The only assumption about I M is that it can handle searches for pattern lengths up to M .…”

Section: The Hybrid Indexmentioning

confidence: 99%

Hybrid Indexing Revisited

Ferrada

¹

,

Kempa

²

,

Puglisi

³

2018

2018 Proceedings of the Twentieth Workshop on Algorithm Engineering and Experiments (ALENEX)

View full text Add to dashboard Cite

Hybrid indexing is a recent approach to text indexing that allows the space-usage of conventional text indexes (e.g., suffix trees, suffix arrays, FM-indexes) to scale well with the text size, n, when z, the size of the Lempel-Ziv parsing of the text, is small relative to n. The price for this improved scalability is that an upper bound M on the pattern length that can be searched for must be declared at index construction time. Because the size of the resulting index contains an O(M z) term, M must be kept reasonably small, though it has been shown that M ≈ 100 leads to acceptable performance in some genomic applications. However, despite its promise, the practical performance of hybrid indexing relative to other compressed index data structures is poorly understood. This paper addresses that need, detailing experiments that show hybrid indexing -when carefully implemented -to be significantly smaller and faster than alternative approaches on a broad range of data of different levels of compressibility. We also describe practical extensions to hybrid indexing that obviate the restriction on M , supporting search for patterns of arbitrary length.

show abstract

Hybrid indexes for repetitive datasets

Cited by 32 publications

References 17 publications

Whisper: Read sorting allows robust mapping of sequencing data

Whisper: Read sorting allows robust mapping of sequencing data

2018 Proceedings of the Twentieth Workshop on Algorithm Engineering and Experiments (ALENEX)

Hybrid Indexing Revisited

Contact Info

Product

Resources

About