Theoretically-Efficient and Practical Parallel In-Place Radix Sorting

Obeya, Omar; Kahssay, Endrias; Fan, Edward M.; Shun, Julian

doi:10.1145/3323165.3323198

Cited by 19 publications

(25 citation statements)

References 41 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Note that, in order to use Radix Sort with IEEE-754 floating point numbers, it is first necessary to shift and mask the bit representation. While Radix Sort is highly sensitive to the key length, which dictates the number of passes, it is nevertheless a very efficient sorting algorithm for numerical types, that is very well-suited for multi-core procedures [6,22,40], and SIMD vectorization [50].…”

Section: Related Workmentioning

confidence: 99%

“…As baselines, we compare against cache-optimized and highly tuned C++ implementations of Radix Sort [51], Timsort [18], Introsort (std::sort), Histogram Sort [4], and IS 4 o [49] (one of the most optimized sorting algorithms we were able to find, which was also recently used in other studies [40] as a comparison point). Note that we use a recursive, equidepth version of Histogram sort that adapts to the input's skew as to avoid severe performance penalties.…”

Section: Setup and Datasetsmentioning

confidence: 99%

“…Our In fact, our learned sorting algorithm provides the best performance even when we include the model training time as a part of the overall sorting time. For example, our experiments show that Learned Sort yields an average of 3.38× performance improvement over C++ STL sort (std::sort)[16], 5.54× improvement over Timsort (Python's default sorting algorithm [45]), 1.49× over Radix sort [51], and 1.31× over IS 4 o [2], a cache-efficient version of the Samplesort and one of the fastest available sorting implementations [40].…”

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

The Case for a Learned Sorting Algorithm

Kristo

Vaidya

Çetintemel

et al. 2020

Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data

View full text Add to dashboard Cite

Sorting is one of the most fundamental algorithms in Computer Science and a common operation in databases not just for sorting query results but also as part of joins (i.e., sortmerge-join) or indexing. In this work, we introduce a new type of distribution sort that leverages a learned model of the empirical CDF of the data. Our algorithm uses a model to efficiently get an approximation of the scaled empirical CDF for each record key and map it to the corresponding position in the output array. We then apply a deterministic sorting algorithm that works well on nearly-sorted arrays (e.g., Insertion Sort) to establish a totally sorted order.We compared this algorithm against common sorting approaches and measured its performance for up to 1 billion normally-distributed double-precision keys. The results show that our approach yields an average 3.38× performance improvement over C++ STL sort, which is an optimized Quicksort hybrid, 1.49× improvement over sequential Radix Sort, and 5.54× improvement over a C++ implementation of Timsort, which is the default sorting function for Java and Python.

show abstract

Section: Related Workmentioning

confidence: 99%

Section: Setup and Datasetsmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

The Case for a Learned Sorting Algorithm

Kristo

Vaidya

Çetintemel

et al. 2020

Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data

View full text Add to dashboard Cite

show abstract

“…These results demonstrate that Vortex-S achieves a substantial improvement over the previous methods, while incurring negligible RAM overhead 𝜖. On Skylake-X (i.e., 𝑐 3 ), it beats the fastest in-place methods [42], [48] by 3 − 4× and STL quicksort by 11×. While Vortex-S is hands-down the fastest technique that can sort 24 GB of keys on these machines, it is interesting to see how its performance stacks up against the best outof-place methods.…”

Section: Sortingmentioning

confidence: 70%

Vortex: Extreme-Performance Memory Abstractions for Data-Intensive Streaming Applications

Hanel

Arman

Xiao

et al. 2020

Proceedings of the Twenty-Fifth International Conference on Architectural Support for Programming Languages and Operating Syste

View full text Add to dashboard Cite

Many applications in data analytics, information retrieval, and cluster computing process huge amounts of information. The complexity of involved algorithms and massive scale of data require a programming model that can not only offer a simple abstraction for inputs larger than RAM, but also squeeze maximum performance out of the available hardware. While these are usually conflicting goals, we show that this does not have to be the case for sequentiallyprocessed data, i.e., in streaming applications. We develop a set of algorithms called Vortex that force the application to generate access violations (i.e., page faults) during processing of the stream, which are transparently handled in such a way that creates an illusion of an infinite buffer that fits into a regular C/C++ pointer. This design makes Vortex by far the simplest-to-use and fastest platform for various types of streaming I/O, inter-thread data transfer, and key shuffling. We introduce several such applications -file I/O wrapper, bounded producer-consumer pipeline, vanishing array, key-partitioning engine, and novel in-place radix sort that is 3 − 4× faster than the best prior approaches.CCS Concepts • Software and its engineering → Virtual memory.

show abstract

“…Consequently, pertinent top-𝑘 applications do not adopt priority queue-based top-𝑘. Instead, they use sort-and-choose approach for top-𝑘 computing on GPUs [6,18,33,44,48]. However, as shown in Figure 17, the GPU-based sort-and-choose top-𝑘 [6] takes much longer time than GPU-based top-k algorithms.…”

mentioning

confidence: 99%

Dr. Top-k: Delegate-Centric Top-k on GPUs

Gaihre,

Zheng,

Weitze

et al. 2021

Preprint

View full text Add to dashboard Cite

Recent top-𝑘 computation efforts explore the possibility of revising various sorting algorithms to answer top-𝑘 queries on GPUs. These endeavors, unfortunately, perform significantly more work than needed. This paper introduces Dr. Top-k, a Delegate-centric top-𝑘 system on GPUs that can reduce the top-𝑘 workloads significantly. Particularly, it contains three major contributions: First, we introduce a comprehensive design of the delegate-centric concept, including maximum delegate, delegate-based filtering, and 𝛽 delegate mechanisms to help reduce the workload for top-𝑘 up to more than 99%. Second, due to the difficulty and importance of deriving a proper subrange size, we perform a rigorous theoretical analysis, coupled with thorough experimental validations to identify the desirable subrange size. Third, we introduce four key system optimizations to enable fast multi-GPU top-𝑘 computation. Taken together, this work constantly outperforms the state-of-the-art.

show abstract

Theoretically-Efficient and Practical Parallel In-Place Radix Sorting

Cited by 19 publications

References 41 publications

The Case for a Learned Sorting Algorithm

The Case for a Learned Sorting Algorithm

Vortex: Extreme-Performance Memory Abstractions for Data-Intensive Streaming Applications

Dr. Top-k: Delegate-Centric Top-k on GPUs

Contact Info

Product

Resources

About