2014
DOI: 10.1098/rsta.2013.0137
|View full text |Cite
|
Sign up to set email alerts
|

Hybrid indexes for repetitive datasets

Abstract: Advances in DNA sequencing mean that databases of thousands of human genomes will soon be commonplace. In this paper, we introduce a simple technique for reducing the size of conventional indexes on such highly repetitive texts. Given upper bounds on pattern lengths and edit distances, we pre-process the text with the lossless data compression algorithm LZ77 to obtain a filtered text, for which we store a conventional index. Later, given a query, we find all matches in the filtered text, then use their positio… Show more

Help me understand this report
View preprint versions

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
2
1

Citation Types

0
41
0

Year Published

2014
2014
2023
2023

Publication Types

Select...
6
1
1

Relationship

0
8

Authors

Journals

citations
Cited by 32 publications
(41 citation statements)
references
References 17 publications
0
41
0
Order By: Relevance
“…This short overview supports our claim that industry-level multiplegenome read mappers are yet to come. There are also a number of theoretical works dedicated to indexing text with wildcard positions (Thachuk, 2013;Hon et al, 2013), where the wildcards represent SNPs, or the more general problem of indexing repetitive data with support for exact or approximate matching (Gagie et al, 2011;Jansson et al, 2014;Ferrada et al, 2014). None of them, however, can be considered a breakthrough, at least for bioinformatics, since none of them was demonstrated to run on multi-gigabyte genomic data (and in some of the cited papers no experimental results are given at all).…”
Section: Introductionmentioning
confidence: 99%
“…This short overview supports our claim that industry-level multiplegenome read mappers are yet to come. There are also a number of theoretical works dedicated to indexing text with wildcard positions (Thachuk, 2013;Hon et al, 2013), where the wildcards represent SNPs, or the more general problem of indexing repetitive data with support for exact or approximate matching (Gagie et al, 2011;Jansson et al, 2014;Ferrada et al, 2014). None of them, however, can be considered a breakthrough, at least for bioinformatics, since none of them was demonstrated to run on multi-gigabyte genomic data (and in some of the cited papers no experimental results are given at all).…”
Section: Introductionmentioning
confidence: 99%
“…The next section reviews basic data structural tools on which hybrid indexing depends. Section 3 then gives an overview of hybrid indexing, as described by Ferrada et al [3]. We then describe our implementation of this basic scheme in Section 4.…”
Section: Our Contributionmentioning
confidence: 99%
“…We call the first type of occurrences primary occurrences and the remaining ones (which must necessarily be completely contained inside LZ phrases) secondary occurrences. The hybrid index [3] reports the primary and secondary occurrences of a query pattern using separate structures, which we now review (see also [7]) 2 3.1 Finding Primary Occurrences. For a given upper bound M on pattern length, let T M be the string containing the characters of T within distance M of their nearest LZ phrase boundaries; characters not adjacent in T are separated in T M by a special character # not in the alphabet of T .…”
Section: The Hybrid Indexmentioning
confidence: 99%
See 1 more Smart Citation
“…Ferrada et al [3] store a conventional patten matching index 3 I M on T M . The only assumption about I M is that it can handle searches for pattern lengths up to M .…”
Section: The Hybrid Indexmentioning
confidence: 99%