In modern data warehousing, data skipping is essential for high query performance. While index structures such as B-trees or hash tables allow for precise pruning, their large storage requirements make them impractical for indexing secondary columns. Therefore, many systems rely on approximate indexes such as min/max sketches (ZoneMaps) or Bloom filters for cost-effective data pruning. For example, Google PowerDrill skips more than 90% of data on average using such indexes. In this paper, we introduce Cuckoo Index (CI), an approximate secondary index structure that represents the many-to-many relationship between keys and data partitions in a highly space-efficient way. At its core, CI associates variable-sized fingerprints in a Cuckoo filter with compressed bitmaps indicating qualifying partitions. With our approach, we target equality predicates in a read-only (immutable) setting and optimize for space efficiency under the premise of practical build and lookup performance. In contrast to per-partition (Bloom) filters, CI produces correct results for lookups with keys that occur in the data. CI allows to control the ratio of false positive partitions for lookups with non-occurring keys. Our experiments with real-world and synthetic data show that CI consumes significantly less space than per-partition filters for the same pruning power for low-to-medium cardinality columns. For high cardinality columns, CI is on par with its baselines.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.