Improving GPGPU resource utilization through alternative thread block scheduling

Lee, Minseok; Song, Seokwoo; Moon, Joosik; Kim, John; Seo, Woong; Cho, Yeongon; Ryu, Soojung

doi:10.1109/hpca.2014.6835937

Cited by 153 publications

(65 citation statements)

References 18 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Eventually, the threads' working sets overflow the L2 cache, degrading performance. This effect has been shown in simulation, but simple analytic models do not take it into account [29], [34]. Numerous other non-obvious scaling results exist, but we do not detail them due to space constraints.…”

Section: High-level Gpgpu Modelmentioning

confidence: 93%

GPGPU performance and power estimation using machine learning

Greathouse

Lyashevsky

et al. 2015

2015 IEEE 21st International Symposium on High Performance Computer Architecture (HPCA)

173

View full text Add to dashboard Cite

Graphics Processing Units (GPUs) have numerous configuration and design options, including core frequency, number of parallel compute units (CUs), and available memory bandwidth. At many stages of the design process, it is important to estimate how application performance and power are impacted by these options. This paper describes a GPU performance and power estimation model that uses machine learning techniques on measurements from real GPU hardware. The model is trained on a collection of applications that are run at numerous different hardware configurations. From the measured performance and power data, the model learns how applications scale as the GPU's configuration is changed. Hardware performance counter values are then gathered when running a new application on a single GPU configuration. These dynamic counter values are fed into a neural network that predicts which scaling curve from the training data best represents this kernel. This scaling curve is then used to estimate the performance and power of the new application at different GPU configurations.Over an 8× range of the number of CUs, a 3.3× range of core frequencies, and a 2.9× range of memory bandwidth, our model's performance and power estimates are accurate to within 15% and 10% of real hardware, respectively. This is comparable to the accuracy of cycle-level simulators. However, after an initial training phase, our model runs as fast as, or faster than the program running natively on real hardware.

show abstract

Section: High-level Gpgpu Modelmentioning

confidence: 93%

GPGPU performance and power estimation using machine learning

Greathouse

Lyashevsky

et al. 2015

2015 IEEE 21st International Symposium on High Performance Computer Architecture (HPCA)

173

View full text Add to dashboard Cite

show abstract

“…Prior research work [5,8], however, has shown that executing the maximum possible number of thread blocks on a streaming multiprocessor is not always the optimal choice from the performance perspective due to inefficient utilization of streaming multiprocessor resources. Indeed, when the thread block scheduler always assigns the maximum thread blocks to a streaming multiprocessor, it might cause a higher number of memory and interconnection network stalls.…”

Section: Proposed Warp Schedulermentioning

confidence: 99%

“…Previous works [5,8] pointed out the drawback of the existing thread block schedulers that maximizing the number of thread blocks assigned to a streaming multiprocessor is not always effective -i.e. increasing the number of thread blocks does not necessarily improve performance.…”

Section: Related Workmentioning

confidence: 99%

A Novel Cooperative Warp and Thread Block Scheduling Technique for Improving the GPGPU Resource Utilization

Thuan¹,

Choi²,

Kim³

et al. 2017

KIPS Transactions on Computer and Communication Systems

View full text Add to dashboard Cite

General-Purpose Graphics Processing Units (GPGPUs) build massively parallel architecture and apply multithreading technology to explore parallelism. By using programming models like CUDA, and OpenCL, GPGPUs are becoming the best in exploiting plentiful thread-level parallelism caused by parallel applications. Unfortunately, modern GPGPU cannot efficiently utilize its available hardware resources for numerous general-purpose applications. One of the primary reasons is the inefficiency of existing warp/thread block schedulers in hiding long latency instructions, resulting in lost opportunity to improve the performance. This paper studies the effects of hardware thread scheduling policy on GPGPU performance. We propose a novel warp scheduling policy that can alleviate the drawbacks of the traditional round-robin policy. The proposed warp scheduler first classifies the warps of a thread block into two groups, warps with long latency and warps with short latency and then schedules the warps with long latency before the warps with short latency. Furthermore, to support the proposed warp scheduler, we also propose a supplemental technique that can dynamically reduce the number of streaming multiprocessors to which will be assigned thread blocks when encountering a high contention degree at the memory and interconnection network. Based on our experiments on a 15-streaming multiprocessor GPGPU platform, the proposed warp scheduling policy provides an average IPC improvement of 7.5% over the baseline round-robin warp scheduling policy. This paper also shows that the GPGPU performance can be improved by approximately 8.9% on average when the two proposed techniques are combined.

show abstract

“…Besides replacement policy, thread scheduling is considered as another critical approach for improving the cache usage efficiency [13,14,15,16]. Through the rearrangement of thread execution order, the re-reference interval of the data of each thread can be dynamically reconfigured, and the cache quota can also adapt to the needs of the running threads.…”

Section: Introductionmentioning

confidence: 99%

VWS

Mao

Chen

et al. 2015

Proceedings of the 52nd Annual Design Automation Conference

View full text Add to dashboard Cite

Massive multi-threading of GPGPU demands for efficient usage of caches with limited capacity. In this work, we propose a versatile warp scheduler (VWS) to reduce the cache miss rate in GPGPU. VWS retains the intra-warp cache locality using an efficient per-warp working set estimator and enhances intra-/inter-cooperative thread array (CTA) cache locality through imposing a CTA-aware scheduling policy and a new CTA dispatching mechanism. The significantly improved hit rate of cache hierarchy enables VWS to achieve on average 38.4% and 9.3% IPC improvement across diverse GPGPU applications compared to a widely-used and a state-of-the-art warp schedulers, respectively.

show abstract

Improving GPGPU resource utilization through alternative thread block scheduling

Cited by 153 publications

References 18 publications

GPGPU performance and power estimation using machine learning

GPGPU performance and power estimation using machine learning

A Novel Cooperative Warp and Thread Block Scheduling Technique for Improving the GPGPU Resource Utilization

VWS

Contact Info

Product

Resources

About