Cache-Aware Kernel Tiling: An Approach for System-Level Performance Optimization of GPU-Based Applications

Maghazeh, Arian; Chattopadhyay, Sudipta; Eles, Petru; Peng, Zebo

doi:10.23919/date.2019.8714861

Cited by 4 publications

(6 citation statements)

References 10 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Only for some architectures and for Stabilization, the miss rate of LLC was acceptable { (17), (18), (19)}. The miss rates of L2 and L3 in these applications defeat the purpose of using caches; in fact, it decreases the application performance (for instance, from (12) to (13)). Besides, these miss rates are far from the expected cache behavior (i.e., ≤ 10%) [14].…”

Section: Resultsmentioning

confidence: 99%

See 1 more Smart Citation

On Cache Limits for Dataflow Applications and Related Efficient Memory Management Strategies

Ghasemi

Cataldo

Diguet

et al. 2021

Workshop on Design and Architectures for Signal and Image Processing (14th Edition)

View full text Add to dashboard Cite

The dataflow paradigm frees the designer to focus on the functionality of an application, independently from the underlying architecture executing it. While mapping the dataflow computational part to the cores seems obvious, the memory aspects do not match accordingly. Dataflow compilers usually do not consider the presence of caches when generating code. A generally accepted idea is that bigger and multi-level caches improve the performance of applications. Unfortunately, state-of-the-art dataflow compilers may prove the exception to this rule. This paper presents two efficient memory management strategies for dataflow applications through a study on the impact of sharing, size, and the number of levels of caches on them. The results show that bigger is not always better, and the foreseen future of more cores and bigger caches do not guarantee software-free better performance for dataflow applications. We propose two strategies, that can be used concurrently, to address the memory aspects of the dataflow model: copy-onwrite and non-temporal memory transfers. Experimental results show that we speed up a computer stereo vision application by 2.1× and reduce the number of L1 data cache misses by 45% while maintaining the actors' source code and design intact.

show abstract

Section: Resultsmentioning

confidence: 99%

“…In [13], a method is proposed for GPU-based applications by splitting both the GPU kernel into sub-kernels and input data into tiles in size of GPU L2 cache. Their work is intended to accelerate applications whose performance is bound to memory latency.…”

Section: Related Workmentioning

confidence: 99%

On Cache Limits for Dataflow Applications and Related Efficient Memory Management Strategies

Ghasemi

Cataldo

Diguet

et al. 2021

Workshop on Design and Architectures for Signal and Image Processing (14th Edition)

View full text Add to dashboard Cite

show abstract

“…For an application with ample scope for concurrency, we have observed that rather than relying on traditional coarse-grained scheduling decisions, implementing fine-grained scheduling policies using PySchedCL where the user specifies an intuitive task component partitioning T after examining the structure of a DAG application results in significantly better execution times. Future work entails investigating sophisticated low-level scheduling approaches such as sub-kernel partitioning [9], [25] at the work-item level for effective interleaving of concurrent kernels. Such approaches coupled with Machine Learning assisted control theoretic scheduling solutions [26] shall be used to develop an auto-tuning framework on top of PySchedCL which would automatically determine given an applicationarchitecture pair, the optimal allocation of command queues across devices in the platform.…”

Section: Discussionmentioning

confidence: 99%

PySchedCL: Leveraging Concurrency in Heterogeneous Data-Parallel Systems

Ghose¹,

Singh²,

Kulaharia³

et al. 2020

Preprint

View full text Add to dashboard Cite

In the past decade, high performance compute capabilities exhibited by heterogeneous GPGPU platforms have led to the popularity of data parallel programming languages such as CUDA and OpenCL. Such languages, however, involve a steep learning curve as well as developing an extensive understanding of the underlying architecture of the compute devices in heterogeneous platforms. This has led to the emergence of several High Performance Computing frameworks which provide high-level abstractions for easing the development of dataparallel applications on heterogeneous platforms. However, the scheduling decisions undertaken by such frameworks only exploit coarsegrained concurrency in data parallel applications. In this paper, we propose PySchedCL, a framework which explores fine-grained concurrency aware scheduling decisions that harness the power of heterogeneous CPU/GPU architectures efficiently. We showcase the efficacy of such scheduling mechanisms over existing coarse-grained dynamic scheduling schemes by conducting extensive experimental evaluations for a Machine Learning based inferencing application.

show abstract

“…However, this type of optimization does not address the coarse-grain inter-actor (i.e., inter-tasks) relation. In Maghazeh et al [17], a method is proposed for GPU-based applications by splitting both the GPU kernel into sub-kernels and input data into tiles in size of GPU L2 cache. Their work is intended to accelerate applications whose performance is bound to memory latency.…”

Section: Related Workmentioning

confidence: 99%

“…Regarding contribution (ii), it fills different gaps from related works focused on proposals [2,6,17,22]. Specifically, we are interested in: (i) keeping the original dataflow modeling granularity (differently from [6,17]); (ii) not making modification in the Linux-based kernel, or any part of the OS (contrary to [2]); and (iii), targeting generic SMP (differently from [22]).…”

Section: Related Workmentioning

confidence: 99%

The Impact of Cache and Dynamic Memory Management in Static Dataflow Applications

Ghasemi

Ruaro

Cataldo

et al. 2022

J Sign Process Syst

View full text Add to dashboard Cite

multi/many-core architecture executing it. State-ofthe-art frameworks allow fast development of dataflow applications providing memory, communicating, and computing optimizations by design time exploration. However, the frameworks usually do not consider cache memory behavior when generating code. A generally accepted idea is that bigger and multi-level caches improve the performance of applications. This work evaluates such a hypothesis in a broad experiment campaign adopting different multi-core configurations related to the number of cores and cache parameters (size, sharing, controllers). The results show that bigger is not always better, and the foreseen future of more cores and bigger caches do not guarantee softwarefree better performance for dataflow applications. Additionally, this work investigates the adoption of two memory management strategies for dataflow applications: Copy-on-Write (CoW) and Non-Temporal Memory transfers (NTM). Experimental results addressing state-of-the-art applications show that NTM and CoW can contribute to reduce the execution time to -5.3% and -15.8%, respectively. CoW, specifically, shows improvements up to -21.8% in energy consumption with -16.8% of average among 22 different cache configurations.

show abstract

Cache-Aware Kernel Tiling: An Approach for System-Level Performance Optimization of GPU-Based Applications

Cited by 4 publications

References 10 publications

On Cache Limits for Dataflow Applications and Related Efficient Memory Management Strategies

On Cache Limits for Dataflow Applications and Related Efficient Memory Management Strategies

PySchedCL: Leveraging Concurrency in Heterogeneous Data-Parallel Systems

The Impact of Cache and Dynamic Memory Management in Static Dataflow Applications

Contact Info

Product

Resources

About