2012
DOI: 10.1007/978-3-642-35926-2_29
|View full text |Cite
|
Sign up to set email alerts
|

Indexing Highly Repetitive Collections

Abstract: Abstract. The need to index and search huge highly repetitive sequence collections is rapidly arising in various fields, including computational biology, software repositories, versioned collections, and others. In this short survey we briefly describe the progress made along three research lines to address the problem: compressed suffix arrays, grammar compressed indexes, and Lempel-Ziv compressed indexes.

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
4
1

Citation Types

0
29
0

Year Published

2013
2013
2019
2019

Publication Types

Select...
8
1
1

Relationship

4
6

Authors

Journals

citations
Cited by 38 publications
(29 citation statements)
references
References 112 publications
0
29
0
Order By: Relevance
“…This is the case of an increasing number of applications that deal with highly repetitive sequences: compressed software repositories, versioned document collections, DNA datasets of individuals of the same species, and so on, which contain many near-copies of the same source code, document, or genome [24]. In this scenario, statistical compressors, or a compressed WT, do not take a proper advantage of the repetitiveness [20], which is crucial to reduce the size of those usually huge datasets by orders of magnitude.…”
Section: Introductionmentioning
confidence: 99%
“…This is the case of an increasing number of applications that deal with highly repetitive sequences: compressed software repositories, versioned document collections, DNA datasets of individuals of the same species, and so on, which contain many near-copies of the same source code, document, or genome [24]. In this scenario, statistical compressors, or a compressed WT, do not take a proper advantage of the repetitiveness [20], which is crucial to reduce the size of those usually huge datasets by orders of magnitude.…”
Section: Introductionmentioning
confidence: 99%
“…In the end, different days have highly similar structures (particularly if they are on the same day of the week). These similarities should be identified and exploited for indexing, as it is already done in other big data sciences [52]. Using compression, it might be possible in the future to manage all historical 4D trajectories efficiently in a main memory database, e.g., SAP HANA [53], for information retrieval.…”
Section: Discussionmentioning
confidence: 99%
“…While there is no consensus on how to represent a pan-genomic reference [2], many bioinformatics research projects have studied pan-genomic read alignment using a specific structure like a graph or a reference plus variations [18,10,19,6,7,14]. On the other hand, the simpler model of the pan-genome as a set of sequences has been mostly studied from a computer science perspective [15,16,8], where the focus has been on efficient indexing and exact pattern matching, but to the best of our knowledge, no off-the-shelf solution for read alignment has been provided.…”
Section: Introductionmentioning
confidence: 99%