We address the problem of designing, implementing and experimenting with compressed data structures that support \(\textsf {rank} \) and \(\textsf {select} \) queries over a dictionary of integers. We shine a new light on this classical problem by showing a connection between the input integers and the geometry of a set of points in a Cartesian plane suitably derived from them. We then build upon some results in computational geometry to introduce the first compressed rank/select dictionary based on the idea of “learning” the distribution of such points via proper linear approximations (LA). We therefore call this novel data structure the la_vector . We prove time and space complexities of the la_vector in several scenarios: in the worst case, in the case of input distributions with finite mean and variance, and taking into account the k th order entropy of some of its building blocks. We also discuss improved hybrid data structures, namely ones that suitably orchestrate known compressed rank/select dictionaries with the la_vector . We corroborate our theoretical results with a large set of experiments over datasets originating from a variety of applications (Web search, DNA sequencing, information retrieval and natural language processing) and show that our approach provides new interesting space-time trade-offs with respect to many well-established compressed rank/select dictionary implementations. In particular, we show that our select is the fastest, and our rank is on the space-time Pareto frontier.
We present the first learned index that supports predecessor, range queries and updates within provably efficient time and space bounds in the worst case. In the (static) context of just predecessor and range queries these bounds turn out to be optimal. We call this learned index the Piecewise Geometric Model index (PGM-index). Its flexible design allows us to introduce three variants which are novel in the context of learned data structures. The first variant of the PGM-index is able to adapt itself to the distribution of the query operations, thus resulting in the first known distribution-aware learned index to date. The second variant exploits the repetitiveness possibly present at the level of the learned models that compose the PGM-index to further compress its succinct space footprint. The third one is a multicriteria variant of the PGM-index that efficiently auto-tunes itself in a few seconds over hundreds of millions of keys to satisfy space-time constraints which evolve over time across users, devices and applications. These theoretical achievements are supported by a large set of experimental results on known datasets which show that the fully-dynamic PGM-index improves the space occupancy of existing traditional and learned indexes by up to three orders of magnitude, while still achieving their same or even better query and update time efficiency. As an example, in the static setting of predecessor and range queries, the PGM-index matches the query performance of a cache-optimised static B+-tree within two orders of magnitude (83×) less space; whereas in the fully-dynamic setting, where insertions and deletions are allowed, the PGM-index improves the query and update time performance of a B + -tree by up to 71% within three orders of magnitude (1140×) less space.
We address the well-known problem of designing, implementing and experimenting compressed data structures for supporting rank and select queries over a dictionary of integers. This problem has been studied far and wide since the end of the '80s with tons of important theoretical and practical results.Following a recent line of research on the so-called learned data structures, we first show that this problem has a surprising connection with the geometry of a set of points in the Cartesian plane suitably derived from the input integers. We then build upon some classical results in computational geometry to introduce the first "learned" scheme for implementing a compressed rank/select dictionary. We prove theoretical bounds on its time and space performance both in the worst case and in the case of input distributions with finite mean and variance.We corroborate these theoretical results with a large set of experiments over datasets originating from a variety of sources and applications (Web, DNA sequencing, information retrieval and natural language processing), and we show that a carefully engineered version of our approach provides new interesting space-time trade-offs with respect to several well-established implementations of Elias-Fano, RRRvector, and random-access vectors of Elias γ/δ-coded gaps.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.