Performance Evaluation of Priority Queues for Fine-Grained Parallel Tasks on GPUs

Baudis, Nikolai; Jacob, Florian; Andelfinger, Philipp

doi:10.1109/mascots.2017.15

Cited by 4 publications

(4 citation statements)

References 43 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Baudis et al [13] evaluate the performance of PQs on a GPU implemented as a single parallel heap or as a set of ring buffers, implicit binary heaps, and splay trees [146] in the context of DES and path finding on grids. Their results indicate that for up to about 500 elements per PQ, ring buffers achieve the highest performance.…”

Section: Representation Of Irregular Data Structures By Arrays and Gridsmentioning

confidence: 99%

A Survey on Agent-based Simulation Using Hardware Accelerators

et al. 2019

Self Cite

View full text Add to dashboard Cite

Due to decelerating gains in single-core CPU performance, computationally expensive simulations are increasingly executed on highly parallel hardware platforms. Agent-based simulations, where simulated entities act with a certain degree of autonomy, frequently provide ample opportunities for parallelisation. Thus, a vast variety of approaches proposed in the literature demonstrated considerable performance gains using hardware platforms such as many-core CPUs and GPUs, merged CPU-GPU chips as well as FPGAs. Typically, a combination of techniques is required to achieve high performance for a given simulation model, putting substantial burden on modellers. To the best of our knowledge, no systematic overview of techniques for agent-based simulations on hardware accelerators has been given in the literature. To close this gap, we provide an overview and categorization of the literature according to the applied techniques. Since at the current state of research, challenges such as the partitioning of a model for execution on heterogeneous hardware are still a largely manual process, we sketch directions for future research towards automating the hardware mapping and execution. This survey targets modellers seeking an overview of suitable hardware platforms and execution techniques for a specific simulation model, as well as methodology researchers interested in potential research gaps requiring further exploration.

show abstract

Section: Representation Of Irregular Data Structures By Arrays and Gridsmentioning

confidence: 99%

A Survey on Agent-based Simulation Using Hardware Accelerators

et al. 2019

Self Cite

View full text Add to dashboard Cite

show abstract

“…While not inherently parallel, in the context of the EM or cache-oblivious models, the cache-oblivious bucket heap [4] and buffer heap [12] structures achieve sub-constant time operations when the block size, B, is sufficiently large. Since there are no parallel, cache-efficient priority queue structures, few works have considered using priority queues on GPUs [13,14]. While, in 2012, He et al [14] presented a priority queue that could achieve a 30x speedup over sequential execution, Baudis et al [13] more recently demonstrated that, for small queues of up to 500 items, simple circular buffers out-perform tree-based queues for a range of applications.…”

Section: Background and Related Workmentioning

confidence: 99%

“…Since there are no parallel, cache-efficient priority queue structures, few works have considered using priority queues on GPUs [13,14]. While, in 2012, He et al [14] presented a priority queue that could achieve a 30x speedup over sequential execution, Baudis et al [13] more recently demonstrated that, for small queues of up to 500 items, simple circular buffers out-perform tree-based queues for a range of applications.…”

Section: Background and Related Workmentioning

confidence: 99%

A parallel priority queue with fast updates for GPU architectures

Berney¹,

Iacono²,

Karsin³

et al. 2019

Preprint

View full text Add to dashboard Cite

The high computational throughput of modern graphics processing units (GPUs) make them the de-facto architecture for high-performance computing applications. However, to achieve peak performance, GPUs require highly parallel workloads, as well as memory access patterns that exhibit good locality of reference. As a result, many state-of-the-art algorithms and data structures designed for GPUs sacrifice work-optimality to achieve the necessary parallelism. Furthermore, some abstract data types are avoided completely due to there being no corresponding data structure that performs well on the GPU. One such abstract data type is the priority queue.Many well-known algorithms rely on priority queue operations as a building block. While various priority queue structures have been developed that are parallel, cache-aware, or cache-oblivious, none has been shown to be efficient on GPUs. In this paper, we present the parBucketHeap, a parallel, cache-efficient data structure designed for modern GPU architectures that supports standard priority queue operations, as well as bulk update. We analyze the structure in several well-known computational models and show that it provides both optimal parallelism and is cache-efficient. We implement the parBucketHeap and, using it, we solve the single-source shortest path (SSSP) problem. Experimental results indicate that, for sufficiently large, dense graphs with high diameter, we out-perform current state-of-the-art SSSP algorithms on the GPU by up to a factor of 5. Unlike existing GPU SSSP algorithms, our approach is work-optimal and places significantly less load on the GPU, reducing power consumption.

show abstract

“…The implementation based on ring buffers and the synchronisation based on atomic operations closely resembles GPUbased discrete-event simulations, which have been shown to achieve high speedup over a CPU-based execution [35], [36]. Our approach to conflict resolution postpones the conflict resolution to after the Act stage and iterates until all conflicts have been resolved based on the relative position of agents.…”

Section: Full Offloadingmentioning

confidence: 99%

Exploring Execution Schemes for Agent-Based Traffic Simulation on Heterogeneous Hardware

Xiao¹,

Andelfinger²,

Eckhoff³

et al. 2018

2018 IEEE/ACM 22nd International Symposium on Distributed Simulation and Real Time Applications (DS-RT)

Self Cite

View full text Add to dashboard Cite

Microscopic traffic simulation is associated with substantial runtimes, limiting the feasibility of large-scale evaluation of traffic scenarios. Even though today heterogeneous hardware comprised of CPUs, graphics processing units (GPUs) and fused CPU-GPU devices is inexpensive and widely available, common traffic simulators still rely purely on CPU-based execution, leaving substantial acceleration potentials untapped. A number of existing works have considered the execution of traffic simulations on accelerators, but have relied on simplified models of road networks and driver behaviour tailored to the given hardware platform. Thus, the existing approaches cannot directly benefit from the vast body of research on the validity of common traffic simulation models. In this paper, we explore the performance gains achievable through the use of heterogeneous hardware when relying on typical traffic simulation models used in CPUbased simulators. We propose a partial offloading approach that relies either on a dedicated GPU or a fused CPU-GPU device. Further, we present a traffic simulation running fully on a manycore GPU and discuss the challenges of this approach. Our results show that a CPU-based parallelisation closely approaches the results of partial offloading, while full offloading substantially outperforms the other approaches. We achieve a speedup of up to 28.7x over the sequential execution on a CPU.

show abstract

Performance Evaluation of Priority Queues for Fine-Grained Parallel Tasks on GPUs

Cited by 4 publications

References 43 publications

A Survey on Agent-based Simulation Using Hardware Accelerators

A Survey on Agent-based Simulation Using Hardware Accelerators

A parallel priority queue with fast updates for GPU architectures

Exploring Execution Schemes for Agent-Based Traffic Simulation on Heterogeneous Hardware

Contact Info

Product

Resources

About