Hardware transactional memory for GPU architectures

Fung, Wilson Wai Lun; Singh, Inderpreet; Brownsword, Andrew; Aamodt, Tor M.

doi:10.1145/2155620.2155655

Cited by 87 publications

(35 citation statements)

References 55 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Scheduling: Lindholm et al [30] suggest that the warp scheduler used in NVIDIA GPUs has zero-cycle overhead, and warps can be scheduled according to their pre- [10] tCL = 10, tRP = 10, tRC = 35, tRAS = 25, tRCD = 12, tRRD = 8, tCDLR = 6, tW R = 11 determined priorities. Since the difference between PA and TL schedulers is primarily in the fetch group formation approach, the hardware overhead of our proposal is similar to that of the TL scheduler.…”

Section: Hardware Overheadmentioning

confidence: 99%

Orchestrated scheduling and prefetching for GPGPUs

Jog

Kayıran

Mishra

et al. 2013

Proceedings of the 40th Annual International Symposium on Computer Architecture

140

View full text Add to dashboard Cite

In this paper, we present techniques that coordinate the thread scheduling and prefetching decisions in a General Purpose Graphics Processing Unit (GPGPU) architecture to better tolerate long memory latencies. We demonstrate that existing warp scheduling policies in GPGPU architectures are unable to effectively incorporate data prefetching. The main reason is that they schedule consecutive warps, which are likely to access nearby cache blocks and thus prefetch accurately for one another, back-to-back in consecutive cycles. This either 1) causes prefetches to be generated by a warp too close to the time their corresponding addresses are actually demanded by another warp, or 2) requires sophisticated prefetcher designs to correctly predict the addresses required by a future "far-ahead" warp while executing the current warp.We propose a new prefetch-aware warp scheduling policy that overcomes these problems. The key idea is to separate in time the scheduling of consecutive warps such that they are not executed back-to-back. We show that this policy not only enables a simple prefetcher to be effective in tolerating memory latencies but also improves memory bank parallelism, even when prefetching is not employed. Experimental evaluations across a diverse set of applications on a 30-core simulated GPGPU platform demonstrate that the prefetch-aware warp scheduler provides 25% and 7% average performance improvement over baselines that employ prefetching in conjunction with, respectively, the commonlyemployed round-robin scheduler or the recently-proposed two-level warp scheduler. Moreover, when prefetching is not employed, the prefetch-aware warp scheduler provides higher performance than both of these baseline schedulers as it better exploits memory bank parallelism.

show abstract

Section: Hardware Overheadmentioning

confidence: 99%

Orchestrated scheduling and prefetching for GPGPUs

Jog

Kayıran

Mishra

et al. 2013

Proceedings of the 40th Annual International Symposium on Computer Architecture

140

View full text Add to dashboard Cite

show abstract

“…Storing copies of a few registers at transaction threads in a CPU core is relatively cheap. For GPUs, however, with thousands of threads running, naively check-pointing large register files would incur significant overhead [Fung et al 2011]. Therefore, it is not practical to use traditional CPU check-pointing mechanisms on the GPU.…”

Section: Paragon Overviewmentioning

confidence: 99%

“…Recent works [Cederman et al 2010;Fung et al 2011] proposed software and hardware transactional memory systems for graphic engines. In these works, each thread is a transaction, and if a transaction aborts, it needs to re-execute.…”

Section: Related Workmentioning

confidence: 99%

Leveraging GPUs using cooperative loop speculation

Samadi

Hormati

Lee

et al. 2014

ACM Trans. Archit. Code Optim.

View full text Add to dashboard Cite

Graphics processing units, or GPUs, provide TFLOPs of additional performance potential in commodity computer systems that frequently go unused by most applications. Even with the emergence of languages such as CUDA and OpenCL, programming GPUs remains a difficult challenge for a variety of reasons, including the inherent algorithmic characteristics and data structure choices used by applications as well as the tedious performance optimization cycle that is necessary to achieve high performance. The goal of this work is to increase the applicability of GPUs beyond CUDA/OpenCL to implicitly data-parallel applications written in C/C++ using speculative parallelization. To achieve this goal, we propose Paragon: a static/dynamic compiler platform to speculatively run possibly data-parallel portions of sequential applications on the GPU while cooperating with the system CPU. For such loops, Paragon utilizes the GPU in an opportunistic way while orchestrating a cooperative relation between the CPU and GPU to reduce the overhead of miss-speculations. Paragon monitors the dependencies for the loops running speculatively on the GPU and nonspeculatively on the CPU using a lightweight distributed conflict detection designed specifically for GPUs, and transfers the execution to the CPU in case a conflict is detected. Paragon resumes the execution on the GPU after the CPU resolves the dependency. Our experiments show that Paragon achieves 4x on average and up to 30x speedup compared to unsafe CPU execution with four threads and 7x on average and up to 64x speedup versus sequential execution across a set of sequential but implicitly data-parallel applications.

show abstract

“…In addition to traditional CPUs architecture, the MPP architectures, such as GPU plays an important role of cooperator performing a high intensive computing. When we port the dynamic memory allocation from CPU to the massively parallel cores environment, it may suffer some performance issues in massive cores such like latency of memory transaction or threads synchronization that can be a bottleneck and reduce total computing power [1]. Therefore, we need a suitable memory management to handle dynamic memory allocation on MPP architecture.…”

Section: Introductionmentioning

confidence: 99%

A new non-blocking approach on GPU dynamical memory management

Lin¹,

Lin

Lee³

2014

Proceedings of the 2013 International Workshop on Computational Science and Engineering — PoS(IWCSE2013)

View full text Add to dashboard Cite

Dynamic memory allocation is a very important and basic technique implemented on modern computer architecture. In the massively parallel processor (MPP) architecture such as Graphics Processing Units (GPUs), many threads try to send allocation or deallocation requests to system in the same time, which could cause the issue of synchronization or race condition. In this paper, we design a new signal model with signal queue to handle the interaction of threads. Based on the signal model, we involve the concept of buddy memory to construct a non-blocking parallel buddy system. Our design have no synchronization problem and adopt a simpler structure implemented than before. Finally, we implement our model in real hardware and experimental results show that the model have better performance than other methods.

show abstract

Hardware transactional memory for GPU architectures

Cited by 87 publications

References 55 publications

Orchestrated scheduling and prefetching for GPGPUs

Orchestrated scheduling and prefetching for GPGPUs

Leveraging GPUs using cooperative loop speculation

A new non-blocking approach on GPU dynamical memory management

Contact Info

Product

Resources

About