CDFShop: Exploring and Optimizing Learned Index Structures

Marcus, Ryan; Zhang, Emily; Kraska, Tim

doi:10.1145/3318464.3384706

Cited by 46 publications

(36 citation statements)

References 4 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…RS achieves the lowest build times, due to its single-pass build phase. Note that the build time of PLEX already includes the autotuning time, unlike RS, CHT, or RMI [11], which were tuned offline via an expensive grid search. Our current implementation of CHT does not support key duplicates, which is the case for the wiki dataset.…”

Section: Discussionmentioning

confidence: 99%

Towards Practical Learned Indexing

Stoian¹,

Kipf²,

Marcus³

et al. 2021

Preprint

Self Cite

View full text Add to dashboard Cite

Latest research proposes to replace existing index structures with learned models. However, current learned indexes tend to have many hyperparameters, often do not provide any error guarantees, and are expensive to build. We introduce Practical Learned Index (PLEX). PLEX only has a single hyperparameter (maximum prediction error) and offers a better trade-off between build and lookup time than stateof-the-art approaches. Similar to RadixSpline, PLEX consists of a spline and a (multi-level) radix layer. It first builds a spline satisfying the given and then performs an ad-hoc analysis of the distribution of spline points to quickly tune the radix layer.

show abstract

Section: Discussionmentioning

confidence: 99%

Towards Practical Learned Indexing

Stoian¹,

Kipf²,

Marcus³

et al. 2021

Preprint

Self Cite

View full text Add to dashboard Cite

show abstract

“…The implementations of RMI, RadixSpline and ALEX are obtained from their open-source repositories [3,45,46]. The RMI hyper-parameters are tuned using CDFShop [40], an automatic RMI optimizer. RadixSpline is manually tuned by varying the error tolerance of the underlying models.…”

Section: Methodsmentioning

confidence: 99%

“…Note that the assumption of 𝑛 ≫ 𝑚 is valid as, in practice, the RMI optimizer [40] typically models a large input relation with complex distribution using RMI of 2 to 3 levels (excluding the root), and a fan-out 𝑚 of 1000. In the case of 2 levels, for example, the 𝑛 𝑚 ratio becomes 1000, which is relatively large.…”

Section: Buffered Grmi Inljmentioning

confidence: 99%

The Case for Learned In-Memory Joins

Sabek¹,

Kraska²

2021

Preprint

Self Cite

View full text Add to dashboard Cite

In-memory join is an essential operator in any database engine. It has been extensively investigated in the database literature. In this paper, we study whether exploiting the CDF-based learned models to boost the join performance is practical or not. To the best of our knowledge, we are the first to fill this gap. We investigate the usage of CDF-based partitioning and learned indexes (e.g., RMI and RadixSpline) in the three join categories; indexed nested loop join (INLJ), sort-based joins (SJ) and hash-based joins (HJ). We proposed new efficient learned variants for the INLJ and SJ categories. In addition, we proposed a reinforcement learning based query optimizer to select the best join algorithm, whether learned or not-learned, for each join query. Our experimental analysis showed that our learned joins variants of INLJ and SJ consistently outperform the state-of-the-art techniques.

show abstract

“…However, RMI does not guarantee an error bound for the keys that are not provided in the training phase. The original RMI work provides a solution that can be used in 2-layer RMI when all models within the RMI are monotonic (Kraska et al, 2018;Marcus et al, 2020;Rashelbach et al, 2020). However, it does not generalize for 3 layer RMI even if the models are monotonic.…”

Section: P-rmi: Partially-3-layer Rmimentioning

confidence: 99%

BWA-MEME: BWA-MEM emulated with a machine learning approach

Jung

Han

2021

Preprint

View full text Add to dashboard Cite

The growing use of next-generation sequencing and enlarged sequencing throughput require efficient short-read alignment, where seeding is one of the major performance bottlenecks. The key challenge in the seeding phase is searching for exact matches of substrings of short reads in the reference DNA sequence. Existing algorithms, however, present limitations in performance due to their frequent memory accesses. This paper presents BWA-MEME, the first full-fledged short read alignment software that leverages learned indices for solving the exact match search problem for efficient seeding. BWA-MEME is a practical and efficient seeding algorithm based on a suffix array search algorithm that solves the challenges in utilizing learned indices for SMEM search which is extensively used in the seeding phase. Our evaluation shows that BWA-MEME achieves up to 3.45x speedup in seeding throughput over BWA-MEM2 by reducing the number of instructions by 4.60x, memory accesses by 8.77x, and LLC misses by 2.21x, while ensuring the identical SAM output to BWA-MEM2.

show abstract

CDFShop: Exploring and Optimizing Learned Index Structures

Cited by 46 publications

References 4 publications

Towards Practical Learned Indexing

Towards Practical Learned Indexing

The Case for Learned In-Memory Joins

BWA-MEME: BWA-MEM emulated with a machine learning approach

Contact Info

Product

Resources

About