2019 Design, Automation &Amp; Test in Europe Conference &Amp; Exhibition (DATE) 2019
DOI: 10.23919/date.2019.8714861
|View full text |Cite
|
Sign up to set email alerts
|

Cache-Aware Kernel Tiling: An Approach for System-Level Performance Optimization of GPU-Based Applications

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
1
1

Citation Types

0
6
0

Year Published

2020
2020
2022
2022

Publication Types

Select...
2
2

Relationship

0
4

Authors

Journals

citations
Cited by 4 publications
(6 citation statements)
references
References 10 publications
0
6
0
Order By: Relevance
“…Only for some architectures and for Stabilization, the miss rate of LLC was acceptable { (17), (18), (19)}. The miss rates of L2 and L3 in these applications defeat the purpose of using caches; in fact, it decreases the application performance (for instance, from (12) to (13)). Besides, these miss rates are far from the expected cache behavior (i.e., ≤ 10%) [14].…”
Section: Resultsmentioning
confidence: 99%
See 1 more Smart Citation
“…Only for some architectures and for Stabilization, the miss rate of LLC was acceptable { (17), (18), (19)}. The miss rates of L2 and L3 in these applications defeat the purpose of using caches; in fact, it decreases the application performance (for instance, from (12) to (13)). Besides, these miss rates are far from the expected cache behavior (i.e., ≤ 10%) [14].…”
Section: Resultsmentioning
confidence: 99%
“…In [13], a method is proposed for GPU-based applications by splitting both the GPU kernel into sub-kernels and input data into tiles in size of GPU L2 cache. Their work is intended to accelerate applications whose performance is bound to memory latency.…”
Section: Related Workmentioning
confidence: 99%
“…For an application with ample scope for concurrency, we have observed that rather than relying on traditional coarse-grained scheduling decisions, implementing fine-grained scheduling policies using PySchedCL where the user specifies an intuitive task component partitioning T after examining the structure of a DAG application results in significantly better execution times. Future work entails investigating sophisticated low-level scheduling approaches such as sub-kernel partitioning [9], [25] at the work-item level for effective interleaving of concurrent kernels. Such approaches coupled with Machine Learning assisted control theoretic scheduling solutions [26] shall be used to develop an auto-tuning framework on top of PySchedCL which would automatically determine given an applicationarchitecture pair, the optimal allocation of command queues across devices in the platform.…”
Section: Discussionmentioning
confidence: 99%
“…However, this type of optimization does not address the coarse-grain inter-actor (i.e., inter-tasks) relation. In Maghazeh et al [17], a method is proposed for GPU-based applications by splitting both the GPU kernel into sub-kernels and input data into tiles in size of GPU L2 cache. Their work is intended to accelerate applications whose performance is bound to memory latency.…”
Section: Related Workmentioning
confidence: 99%
“…Regarding contribution (ii), it fills different gaps from related works focused on proposals [2,6,17,22]. Specifically, we are interested in: (i) keeping the original dataflow modeling granularity (differently from [6,17]); (ii) not making modification in the Linux-based kernel, or any part of the OS (contrary to [2]); and (iii), targeting generic SMP (differently from [22]).…”
Section: Related Workmentioning
confidence: 99%