A comprehensive study of main-memory partitioning and its application to large-scale comparison- and radix-sort

Polychroniou, Orestis; Ross, Kenneth A.

doi:10.1145/2588555.2610522

Cited by 83 publications

(83 citation statements)

References 15 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…To efficiently exploit SIMD instructions and the multiple cores of today's processors, multiway mergesort has gained popularity as a high-performance in-memory sorting algorithm for sorting 32-bit or 64-bit integer values in database systems [7][8][9] or in distributed sorting systems running on large-scale supercomputers [5] or clusters [6]. Because many widely used sorting algorithms, such as quicksort, are not suitable for exploiting the SIMD instructions, multiway mergesort outperforms them by exploiting the SIMD instructions.…”

Section: Related Workmentioning

confidence: 99%

“…Recently, multiway mergesort implemented with SIMD instructions has been used as a high performance in-memory sorting algorithm for sorting 32-bit or 64-bit integer values [1][2][3][4][5][6][7][8][9]. By using the SIMD instructions efficiently in the merge operation, multiway mergesort outperforms other comparison-based sorting algorithms, such as quicksort, that are not suitable for exploiting the SIMD instructions.…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

SIMD- and cache-friendly algorithm for sorting an array of structures

Inoue

Taura

2015

Proc. VLDB Endow.

View full text Add to dashboard Cite

This paper describes our new algorithm for sorting an array of structures by efficiently exploiting the SIMD instructions and cache memory of today's processors. Recently, multiway mergesort implemented with SIMD instructions has been used as a high-performance in-memory sorting algorithm for sorting integer values. For sorting an array of structures with SIMD instructions, a frequently used approach is to first pack the key and index for each record into an integer value, sort the key-index pairs using SIMD instructions, then rearrange the records based on the sorted key-index pairs. This approach can efficiently exploit SIMD instructions because it sorts the key-index pairs while packed into integer values; hence, it can use existing highperformance sorting implementations of the SIMD-based multiway mergesort for integers. However, this approach has frequent cache misses in the final rearranging phase due to its random and scattered memory accesses so that this phase limits both single-thread performance and scalability with multiple cores. Our approach is also based on multiway mergesort, but it can avoid costly random accesses for rearranging the records while still efficiently exploiting the SIMD instructions. Our results showed that our approach exhibited up to 2.1x better single-thread performance than the key-index approach implemented with SIMD instructions when sorting 512M 16-byte records on one core. Our approach also yielded better performance when we used multiple cores. Compared to an optimized radix sort, our vectorized multiway mergesort achieved better performance when the each record is large. Our vectorized multiway mergesort also yielded higher scalability with multiple cores than the radix sort.

show abstract

Section: Related Workmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

SIMD- and cache-friendly algorithm for sorting an array of structures

Inoue

Taura

2015

Proc. VLDB Endow.

View full text Add to dashboard Cite

show abstract

“…Zhang and Ré focus on statistical analytics and conclude that awareness of the core topology can improve performance by an order of magnitude compared to the state-of-the-art systems [72]. On the other hand, the majority of the proposals that target building NUMA-aware data management systems focus on removing memory bandwidth bottlenecks for analytical applications and specifically devising efficient join and sorting algorithms that minimize data movement [3,6,42,51]. However, OLTP workloads cannot saturate memory bandwidths and their main problem is ensuring efficient synchronization among threads [52].…”

Section: Performance On Multisocket Multicoresmentioning

confidence: 99%

Characterization of the Impact of Hardware Islands on OLTP

Porobic

Pandis

Branco³

et al. 2015

The VLDB Journal

View full text Add to dashboard Cite

Modern hardware is abundantly parallel and increasingly heterogeneous. The numerous processing cores have non-uniform access latencies to the main memory and processor caches, which causes variability in the communication costs. Unfortunately, database systems mostly assume that all processing cores are the same and that microarchitecture differences are not significant enough to appear in critical database execution paths. As we demonstrate in this paper, however, non-uniform core topology does appear in the critical path and conventional database architectures achieve suboptimal and even worse, unpredictable performance. We perform a detailed performance analysis of OLTP deployments in servers with multiple cores per CPU (multicore) and multiple CPUs per server (multisocket). We compare different database deployment strategies where we vary the number and size of independent database instances running on a single server, from a single shared-everything instance to fine-grained shared-nothing configurations. We quantify the impact of non-uniform hardware on various deployments by (a) examining how efficiently each deployment uses the available hardware resources and (b) measuring the impact of distributed transactions and skewed requests on different workloads. We show that no strategy is optimal for all cases and that the best choice depends on the combination of hardware topology and workload characteristics. Finally, we argue that transaction processing systems must be aware of the hardware topology in order to achieve predictably high performance.

show abstract

“…For example, if we want an implementation that does not require linear auxiliary space, we need to use "in-place" partitioning, which affects the algorithm and its performance significantly. Recent work provides a detailed explanation and exploration of many such variants [Polychroniou and Ross 2014]. We break our analysis of software partitioning down by phase, first examining data shuffling policies in isolation, then later including the computation of the partition function.…”

Section: Partitioning Backgroundmentioning

confidence: 99%

“…In this article, we This manuscript contains content previously published ISCA '13 [Wu et al 2013]. This extended article substitutes a state-of-the-art software partitioner [Polychroniou and Ross 2014] for the microbenchmark used in the original paper and includes the new, extensive exploration of software partitioning performance and energy found in Section 3. The research was supported by grants from the National Science Foundation (CCF-1065338 and IIS-0915956) and a gift from Oracle Corporation.…”

Section: Introductionmentioning

confidence: 99%

Energy Analysis of Hardware and Software Range Partitioning

Polychroniou

Barker

et al. 2014

ACM Trans. Comput. Syst.

Self Cite

View full text Add to dashboard Cite

Data partitioning is a critical operation for manipulating large datasets because it subdivides tasks into pieces that are more amenable to efficient processing. It is often the limiting factor in database performance and represents a significant fraction of the overall runtime of large data queries. This article measures the performance and energy of state-of-the-art software partitioners, and describes and evaluates a hardware range partitioner that further improves efficiency.The software implementation is broken into two phases, allowing separate analysis of the partition function computation and data shuffling costs. Although range partitioning is commonly thought to be more expensive than simpler strategies such as hash partitioning, our measurements indicate that careful data movement and optimization of the partition function can allow it to approach the throughput and energy consumption of hash or radix partitioning.For further acceleration, we describe a hardware range partitioner, or HARP, a streaming framework that offers a seamless execution environment for this and other streaming accelerators, and a detailed analysis of a 32nm physical design that matches the throughput of four to eight software threads while consuming just 6.9% of the area and 4.3% of the power of a Xeon core in the same technology generation.

show abstract

A comprehensive study of main-memory partitioning and its application to large-scale comparison- and radix-sort

Cited by 83 publications

References 15 publications

SIMD- and cache-friendly algorithm for sorting an array of structures

SIMD- and cache-friendly algorithm for sorting an array of structures

Characterization of the Impact of Hardware Islands on OLTP

Energy Analysis of Hardware and Software Range Partitioning

Contact Info

Product

Resources

About