2017
DOI: 10.1145/3108139
|View full text |Cite
|
Sign up to set email alerts
|

GPU Multisplit

Abstract: Multisplit is a broadly useful parallel primitive that permutes its input data into contiguous buckets or bins, where the function that categorizes an element into a bucket is provided by the programmer. Due to the lack of an e cient multisplit on GPUs, programmers often choose to implement multisplit with a sort. One way is to rst generate an auxiliary array of bucket IDs and then sort input data based on it. In case smaller indexed buckets possess smaller valued keys, another way for multisplit is to directl… Show more

Help me understand this report
View preprint versions

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

0
11
0

Year Published

2017
2017
2024
2024

Publication Types

Select...
4
3

Relationship

2
5

Authors

Journals

citations
Cited by 17 publications
(11 citation statements)
references
References 24 publications
0
11
0
Order By: Relevance
“…The CUDA Data Parallel Primitives Library (CUDPP) [1] is a library of fundamental DPPs and algorithms written in Nvidia CUDA C [85] and designed for high-performance execution on CUDA-compatible GPUs. Each DPP and algorithm incorporated into the library is considered best-in-class and typically published in peer-reviewed literature (e.g., radix sort [6], [76], mergesort [27], [97], and cuckoo hashing [3], [4]). Thus, its data-parallel implementations are constantly updated to reflect the state-of-the-art.…”
Section: Data Parallel Primitivesmentioning
confidence: 99%
“…The CUDA Data Parallel Primitives Library (CUDPP) [1] is a library of fundamental DPPs and algorithms written in Nvidia CUDA C [85] and designed for high-performance execution on CUDA-compatible GPUs. Each DPP and algorithm incorporated into the library is considered best-in-class and typically published in peer-reviewed literature (e.g., radix sort [6], [76], mergesort [27], [97], and cuckoo hashing [3], [4]). Thus, its data-parallel implementations are constantly updated to reflect the state-of-the-art.…”
Section: Data Parallel Primitivesmentioning
confidence: 99%
“…The distributed mode assigns each key (and its associated values) to exactly one distinct GPU. This is done by first partitioning keys of an input batch according to their corresponding GPU ID by means of a device-sided multi-split [16] followed by scattering these segments to the GPUs where they belong. In case each participating GPU holds a separate input batch, we use an all-to-all communication primitive on NVLink connected systems [17] to simultaneously exchange segments between all GPUs.…”
Section: E Multi-gpu Supportmentioning
confidence: 99%
“…Our bulk strategy for cleanup is to 1) iteratively merge all occupied levels from the smallest to the largest (neglecting the LSB to preserve time ordering), 2) mark all unmarked stale elements (e.g., overwriting the LSBs), 3) compact all valid elements together (e.g., using a two-bucket multisplit [20] to collect all unmarked valid elements in stage 2), 4) add enough placebos, and 5) redistribute (already sorted) elements to different (new) levels. Since all levels are already sorted, merging them together iteratively is much faster than resorting all of them together.…”
Section: E Cleanupmentioning
confidence: 99%
“…The more elements to be removed the better. For example, with n = (2 6 − 1)b elements where b = 2 20 , cleanup operations when {10, 50}% of elements should be removed runs at {1870.2, 1828.2} M elements/s. A GPU LSM with roughly the same size (n = (2 7 − 1)b with b = 2 19 ) with {10, 50}% of elements removed results in {1842.5, 1794.3} M elements/s.…”
Section: Cleanup and Its Relation With Queriesmentioning
confidence: 99%