Fine-Grained Synchronizations and Dataflow Programming on GPUs

Li, Ang; Braak, Gert-Jan van den; Corporaal, Henk; Kumar, Akash

doi:10.1145/2751205.2751232

Cited by 41 publications

(16 citation statements)

References 21 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Li, et al [24] propose a lightweight scratchpad memory lock design in software for older Nvidia GPUs (Fermi and Kepler) that uses software atomics for scratchpad memories. Their solution improves local (i.e.…”

Section: Gpu Solutionsmentioning

confidence: 99%

Fast Fine-Grained Global Synchronization on GPUs

Wang

Fussell

Lin

2019

Proceedings of the Twenty-Fourth International Conference on Architectural Support for Programming Languages and Operating Syst

View full text Add to dashboard Cite

This paper extends the reach of General Purpose GPU programming by presenting a software architecture that supports efficient fine-grained synchronization over global memory. The key idea is to transform global synchronization into global communication so that conflicts are serialized at the thread block level. With this structure, the threads within each thread block can synchronize using low latency, high-bandwidth local scratchpad memory. To enable this architecture, we implement a scalable and efficient message passing library. Using Nvidia GTX 1080 ti GPUs, we evaluate our new software architecture by using it to solve a set of five irregular problems on a variety of workloads. We find that on average, our solutions improve performance over carefully tuned state-of-the-art solutions by 3.6×. CCS Concepts • Computer systems organization → Single instruction, multiple data; • Software and its engineering → Mutual exclusion; Message passing.

show abstract

Section: Gpu Solutionsmentioning

confidence: 99%

Fast Fine-Grained Global Synchronization on GPUs

Wang

Fussell

Lin

2019

Proceedings of the Twenty-Fourth International Conference on Architectural Support for Programming Languages and Operating Syst

View full text Add to dashboard Cite

show abstract

“…Li et al have provided a solution that mandates the programmer to responsibly handle this deadlock by preventing illegal accesses to locked locations in the main storage. This is achieved by using the lock bits appropriately as the lock unit is not configured to track ownership of locks.…”

Section: Deadlocksmentioning

confidence: 99%

A deadlock‐free lock‐based synchronization for GPUs

Anand

Srivastava

Shyamasundar

2018

Concurrency and Computation

View full text Add to dashboard Cite

Summary Graphics Processing Units (GPUs) have evolved from pure graphics applications toward general purpose applications, often referred to as GPGPU computing. However, its scope is still limited to data‐parallel applications that require little synchronization. As synchronization on GPUs is quite costly, synchronization requirements in GPUs are usually realized using existing synchronization primitives like atomic operations and barriers. These approaches either incur significant overhead or place certain restrictions in their usage, affecting the scalability/scope of such applications. The lack of adequate support for fine‐grained synchronization has restricted the realization of irregular algorithms on GPUs, wherein control flow and memory access patterns are data‐dependent and unpredictable. Recently, there has been an interest in building relationship between lock‐step semantics and interleaving semantics and to develop lock‐based synchronization mechanism for GPUs to overcome these issues. GPUs follow SIMD, and hence, when adapted for general purpose computing, new distinct deadlock scenarios arise. In this paper, we discuss various deadlock scenarios that can happen in GPUs, and present a modeling of deadlocks in GPUs. We shall first illustrate such deadlock scenarios in GPU applications, and then describe a novel lock‐based deadlock‐free, fine‐grained synchronization mechanism for GPU architectures that overcomes deadlocks without a significant overhead. We further establish the correctness of our methods and discuss the performance overheads.

show abstract

“…Synchronization remains a performance bottleneck for many applications and has long been a classic problem in computer systems research [7,18,24,34,35]. To evaluate the synchronization cost in SpTRSV, we run a parallel SpTRSV implemented by Park et al [33] based on the aforementioned level-set approach.…”

Section: Motivation For Avoiding Synchronizationmentioning

confidence: 99%

Fast synchronization‐free algorithms for parallel sparse triangular solves with multiple right‐hand sides

Liu

Hogg

et al. 2017

Concurrency and Computation

Self Cite

View full text Add to dashboard Cite

The sparse triangular solve kernels, SpTRSV and SpTRSM, are important building blocks for a number of numerical linear algebra routines. Parallelizing SpTRSV and SpTRSM on today's manycore platforms, such as GPUs, is not an easy task since computing a component of the solution may depend on previously computed components, enforcing a degree of sequential processing. As a consequence, most existing work introduces a preprocessing stage to partition the components into a group of level-sets or coloursets so that components within a set are independent and can be processed simultaneously during the subsequent solution stage. However, this class of methods requires a long preprocessing time as well as significant runtime synchronization overheads between the sets. To address this, we propose in this paper novel approaches for SpTRSV and SpTRSM in which the ordering between components is naturally enforced within the solution stage. In this way, the cost for preprocessing can be greatly reduced, and the synchronizations between sets are completely eliminated. To further exploit the data-parallelism, we also develop an adaptive scheme for efficiently processing multiple right-hand sides in SpTRSM. A comparison with a state-of-the-art library supplied by the GPU vendor, using 20 sparse matrices on the latest GPU device, shows that the proposed approach obtains an average speedup of over two for SpTRSV and up to an order of magnitude speedup for SpTRSM. In addition, our method is up to two orders of magnitude faster for the preprocessing stage than existing SpTRSV and SpTRSM methods.

show abstract

Fine-Grained Synchronizations and Dataflow Programming on GPUs

Cited by 41 publications

References 21 publications

Fast Fine-Grained Global Synchronization on GPUs

Fast Fine-Grained Global Synchronization on GPUs

A deadlock‐free lock‐based synchronization for GPUs

Fast synchronization‐free algorithms for parallel sparse triangular solves with multiple right‐hand sides

Contact Info

Product

Resources

About