Indexing Highly Repetitive Collections

Navarro, Gonzalo

doi:10.1007/978-3-642-35926-2_29

Cited by 38 publications

(29 citation statements)

References 112 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…This is the case of an increasing number of applications that deal with highly repetitive sequences: compressed software repositories, versioned document collections, DNA datasets of individuals of the same species, and so on, which contain many near-copies of the same source code, document, or genome [24]. In this scenario, statistical compressors, or a compressed WT, do not take a proper advantage of the repetitiveness [20], which is crucial to reduce the size of those usually huge datasets by orders of magnitude.…”

Section: Introductionmentioning

confidence: 99%

Grammar Compressed Sequences with Rank/Select Support

Navarro

Ordóñez

2014

String Processing and Information Retrieval

Self Cite

View full text Add to dashboard Cite

Abstract. Sequence representations supporting not only direct access to their symbols, but also rank/select operations, are a fundamental building block in many compressed data structures. In several recent applications, the need to represent highly repetitive sequences arises, where statistical compression is ineffective. We introduce grammar-based representations for repetitive sequences, which use up to 10% of the space needed by representations based on statistical compression, and support direct access and rank/select operations within tens of microseconds.

show abstract

Section: Introductionmentioning

confidence: 99%

Grammar Compressed Sequences with Rank/Select Support

Navarro

Ordóñez

2014

String Processing and Information Retrieval

Self Cite

View full text Add to dashboard Cite

show abstract

“…In the end, different days have highly similar structures (particularly if they are on the same day of the week). These similarities should be identified and exploited for indexing, as it is already done in other big data sciences [52]. Using compression, it might be possible in the future to manage all historical 4D trajectories efficiently in a main memory database, e.g., SAP HANA [53], for information retrieval.…”

Section: Discussionmentioning

confidence: 99%

Efficient Compression of 4D-Trajectory Data in Air Traffic Management

Wandelt

Sun

2014

IEEE Trans. Intell. Transport. Syst.

View full text Add to dashboard Cite

Air traffic management (ATM) is facing a tremendous increase in the amount of available flight data, particularly four-dimensional (4D) trajectories. Computational requirements for analysis and storage of such bulk of data are steeply increasing. Compression is one key technology to address this challenge. In this paper we propose two techniques for compressing air traffic 4D trajectories. Our first technique analyzes a set of samples and computes a prediction for the most likely picked successor coordinate by a random walker. The second technique, i.e., referential compression, compresses a 4D trajectory as a collection of subtrajectory pointers into a reference trajectory. We evaluate our algorithms on trajectory data from the Demand Data Repository provided by EUROCONTROL. We show that a combination of our referential and statistical compression techniques compresses 4D trajectories of all air traffic over Europe in the year 2013 from 60 GB down to 0.78 GB, achieving a compression ratio of more than 75 : 1. The compression ratio for our techniques increases with the number of to-be-compressed flights, whereas standard compression techniques achieve a fixed compressed ratio for any number of flights. Our work contributes toward efficient handling of the increasing amount of traffic data in ATM.Index Terms-Air traffic management (ATM), compression, four dimensional (4D) trajectories.

show abstract

“…While there is no consensus on how to represent a pan-genomic reference [2], many bioinformatics research projects have studied pan-genomic read alignment using a specific structure like a graph or a reference plus variations [18,10,19,6,7,14]. On the other hand, the simpler model of the pan-genome as a set of sequences has been mostly studied from a computer science perspective [15,16,8], where the focus has been on efficient indexing and exact pattern matching, but to the best of our knowledge, no off-the-shelf solution for read alignment has been provided.…”

Section: Introductionmentioning

confidence: 99%

CHIC: a short read aligner for pan-genomic references

Valenzuela

Mäkinen

2017

Preprint

View full text Add to dashboard Cite

Abstract. Recently the topic of computational pan-genomics has gained increasing attention, and particularly the problem of moving from a single-reference paradigm to a pan-genomic one. Perhaps the simplest way to represent a pan-genome is to represent it as a set of sequences. While indexing highly repetitive collections has been intensively studied in the computer science community, the research has focused on efficient indexing and exact pattern patching, making most solutions not yet suitable to be used in bioinformatic analysis pipelines. Results: We present CHIC, a short-read aligner that indexes very large and repetitive references using a hybrid technique that combines LempelZiv compression with Burrows-Wheeler read aligners. Availability: Our tool is open source and available online at https://gitlab.com/dvalenzu/CHIC

show abstract

Indexing Highly Repetitive Collections

Cited by 38 publications

References 112 publications

Grammar Compressed Sequences with Rank/Select Support

Grammar Compressed Sequences with Rank/Select Support

Efficient Compression of 4D-Trajectory Data in Air Traffic Management

CHIC: a short read aligner for pan-genomic references

Contact Info

Product

Resources

About