Fast parallel GPU-sorting using a hybrid algorithm

Sintorn, Erik; Assarsson, Ulf

doi:10.1016/j.jpdc.2008.05.012

Cited by 128 publications

(91 citation statements)

References 8 publications

Supporting

Mentioning

Contrasting

Unclassified

Order By: Relevance

“…Furthermore, GPUABiSort [9] was proposed, that is based on adaptive bitonic sort [2] and rearranges the data using bitonic trees to reduce the number of comparisons. Recently added GPU capabilities like scattered writes, flexible comparisons and atomic operations on memory have enabled methods combining radixsort and mergesort to achieve faster performances on modern GPUs [19,21,22].…”

Section: Related Workmentioning

confidence: 99%

“…Table 2 lists the best reported execution times 4 (in seconds) for varying dataset sizes on various architectures( IBM Cell [7], Nvidia 8600 GTS [22], Nvidia 8800 GTX and Quadro FX 5600 [19], Intel 2-core Xeon with Quicksort [7], and IBM PowerPC 970MP [11]). Our performance numbers are faster than those reported on other architectures.…”

Section: Comparison With Analytical Modelmentioning

confidence: 99%

See 1 more Smart Citation

Efficient implementation of sorting on multi-core SIMD CPU architecture

et al. 2008

View full text Add to dashboard Cite

Sorting a list of input numbers is one of the most fundamental problems in the field of computer science in general and high-throughput database applications in particular. Although literature abounds with various flavors of sorting algorithms, different architectures call for customized implementations to achieve faster sorting times.This paper presents an efficient implementation and detailed analysis of MergeSort on current CPU architectures. Our SIMD implementation with 128-bit SSE is 3.3X faster than the scalar version. In addition, our algorithm performs an efficient multiway merge, and is not constrained by the memory bandwidth. Our multi-threaded, SIMD implementation sorts 64 million floating point numbers in less than 0.5 seconds on a commodity 4-core Intel processor. This measured performance compares favorably with all previously published results.Additionally, the paper demonstrates performance scalability of the proposed sorting algorithm with respect to certain salient architectural features of modern chip multiprocessor (CMP) architectures, including SIMD width and core-count. Based on our analytical models of various architectural configurations, we see excellent scalability of our implementation with SIMD width scaling up to 16X wider than current SSE width of 128-bits, and CMP core-count scaling well beyond 32 cores. Cycle-accurate simulation of Intel's upcoming x86 many-core Larrabee architecture confirms scalability of our proposed algorithm.

show abstract

Section: Related Workmentioning

confidence: 99%

Section: Comparison With Analytical Modelmentioning

confidence: 99%

Efficient implementation of sorting on multi-core SIMD CPU architecture

et al. 2008

View full text Add to dashboard Cite

show abstract

“…Global Radix Uses radix sort on the entire sequence [1]. Hybridsort Uses a bucket sort followed by a merge sort [15]. STL-Introsort This is the Introsort implementation found in the C++ Standard Library.…”

Section: Experimental Evaluationmentioning

confidence: 99%

“…Sengupta et al [1] have presented a radix-sort and a Quicksort implementation. Recently, Sintorn et al [15] presented a sorting algorithm that combines bucket sort with merge sort.…”

Section: Introductionmentioning

confidence: 99%

A Practical Quicksort Algorithm for Graphics Processors

Cederman

Tsigas

2008

Algorithms - ESA 2008

121

View full text Add to dashboard Cite

Abstract. In this paper we present GPU-Quicksort, an efficient Quicksort algorithm suitable for highly parallel multi-core graphics processors. Quicksort has previously been considered as an inefficient sorting solution for graphics processors, but we show that GPU-Quicksort often performs better than the fastest known sorting implementations for graphics processors, such as radix and bitonic sort. Quicksort can thus be seen as a viable alternative for sorting large quantities of data on graphics processors.

show abstract

“…Prucell et al [10] have presented an implementation of bitonic merge sort on GPU based on an implementation by kapasi et al [11]. Sintorn et al [12] presented a hybrid sorting algorithm which splits the data with a bucket sort and then uses merge sort on the resulting blocks.…”

Section: Related Workmentioning

confidence: 99%

GPU Matrix Sort (An Efficient Implementation of Merge Sort)

Panwar¹,

Kumar²,

Bhargava³

2014

IJCA

View full text Add to dashboard Cite

Sorting is one of the frequent used operations in computer science. Due to highly parallel computing nature of GPU architecture; it can be utilized for sorting purpose. We have considered the input array that is to be sorted in a 2D matrix form and applied a modified version of merge sort on that matrix. This modification leads to a much efficient sorting algorithm with reduced complexity. Therefore a lot of work has already been done to improve the efficiency of sorting algorithms. In this paper We have used the GPU architecture for solving the sorting problem.

show abstract

Fast parallel GPU-sorting using a hybrid algorithm

Cited by 128 publications

References 8 publications

Efficient implementation of sorting on multi-core SIMD CPU architecture

Efficient implementation of sorting on multi-core SIMD CPU architecture

A Practical Quicksort Algorithm for Graphics Processors

GPU Matrix Sort (An Efficient Implementation of Merge Sort)

Contact Info

Product

Resources

About