Efficient Parallel Sort on AVX-512-Based Multi-Core and Many-Core Architectures

Yin, Zekun; Zhang, Tianyu; Müller, André C.; Liu, Hui; Wei, Yanjie; Schmidt, Bertil; Liu, Weiguo

doi:10.1109/hpcc/smartcity/dss.2019.00038

Cited by 12 publications

(9 citation statements)

References 16 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Usually, after sorting the values column-wise the matrix is transposed, so that the sorted column vectors become row vectors [4,5]. We avoid this transposition and start merging with vectorized Bitonic Merge networks on the sorted columns themselves.…”

Section: Sorting Networkmentioning

confidence: 99%

“…To execute the same modules using vectorized compare-and-exchange operations (coex) in a 4 × 2 matrix, we first swap the adjacent elements of the last two vectors (see Figure 2b left). In the second merge step the compare-and-exchange modules (1,3), (2,4), (5,7) and (6,8) are executed. In the vectorized version (Figure 2b center), we swap the adjacent elements of the second and fourth vectors before executing the two vectorized compare-and-exchange operations.…”

Section: Sorting Networkmentioning

confidence: 99%

See 1 more Smart Citation

Vectorized and performance-portable Quicksort

Blacher¹,

Giesen²,

Sanders³

et al. 2022

Preprint

View full text Add to dashboard Cite

Recent works showed that implementations of Quicksort using vector CPU instructions can outperform the non-vectorized algorithms in widespread use. However, these implementations are typically singlethreaded, implemented for a particular instruction set, and restricted to a small set of key types. We lift these three restrictions: our proposed vqsort algorithm integrates into the state-of-the-art parallel sorter ips4o, with a geometric mean speedup of 1.59. The same implementation works on seven instruction sets (including SVE and RISC-V V) across four platforms. It also supports floating-point and 16-128 bit integer keys. To the best of our knowledge, this is the fastest sort for non-tuple keys on CPUs, up to 20 times as fast as the sorting algorithms implemented in standard libraries. This paper focuses on the practical engineering aspects enabling the speed and portability, which we have not yet seen demonstrated for a Quicksort implementation. Furthermore, we introduce compact and transpose-free sorting networks for in-register sorting of small arrays, and a vector-friendly pivot sampling strategy that is robust against adversarial input.

show abstract

Section: Sorting Networkmentioning

confidence: 99%

Section: Sorting Networkmentioning

confidence: 99%

Vectorized and performance-portable Quicksort

Blacher¹,

Giesen²,

Sanders³

et al. 2022

Preprint

View full text Add to dashboard Cite

show abstract

“…Yin et al [26] described an efficient parallel sort on AVX-512-based multi-core and many-core architectures. Their approach achieves to sort 1.1 billion floats per second on an Intel KNL (AVX-512).…”

Section: Related Work On Vectorized Sorting Algorithmsmentioning

confidence: 99%

Peer Review #1 of "A fast vectorized sorting implementation based on the ARM scalable vector extension (SVE) (v0.1)"

2021

View full text Add to dashboard Cite

“…Yin et al [29] described an efficient parallel sort on AVX-512-based multicore and many-core architectures. Their approach achieves to sort 1.1 billion floats per second on an Intel KNL (AVX-512).…”

Section: Related Work On Vectorized Sorting Algorithmsmentioning

confidence: 99%