Combining Software Cache Partitioning and Loop Tiling for Effective Shared Cache Management

Kelefouras, Vasilios; Κεραμίδας, Γεώργιος; Nikolaos, Voros

doi:10.1145/3202663

Cited by 5 publications

(5 citation statements)

References 49 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…This method is applicable to all single-core and shared cache multi-core CPUs. In this section, we explain our method for single core CPUs which is applicable to shared cache CPUs too, by using the software shared cache partitioning method given in our previous work [21]; no more than p threads can run in parallel (one on each core), where p is the number of cores (single threaded codes only).…”

Section: Proposed Methodologymentioning

confidence: 99%

See 1 more Smart Citation

A methodology correlating code optimizations with data memory accesses, execution time and energy consumption

Kelefouras

Djemame

2019

J Supercomput

Self Cite

View full text Add to dashboard Cite

The advent of data proliferation and electronic devices gets low execution time and energy consumption software in the spotlight. The key to optimizing software is the correct choice, order as well as parameters of optimizations-transformations, that has remained an open problem in compilation research for decades for various reasons. First, most of the transformations are interdependent and thus addressing them separately is not effective. Second, it is very hard to couple the transformation parameters to the processor architecture (e.g., cache size) and algorithm characteristics (e.g. data reuse); therefore compiler designers and researchers either do not take them into account at all or do it partly. Third, the exploration space, i.e., the set of all optimization configurations that have to be explored, is huge and thus searching is impractical. In this paper, the above problems are addressed for data dominant affine loop kernels, delivering significant contributions. A novel methodology is presented reducing the exploration space of six code optimizations by many orders of magnitude. The objective can be Execution Time (ET), Energy consumption (E) or the number of L1, L2 and main memory accesses. The exploration space is reduced in two phases. Firstly, by applying a novel register blocking algorithm and a novel loop tiling algorithm and secondly, by computing the maximum and minimum ET/E values for each optimization set. The proposed methodology has been evaluated for both embedded and general purpose CPUs and for seven well known algorithms, achieving high memory access, speedup and energy consumption gain values (from 1.17 up to 40) over gcc compiler, hand written optimized code and Polly. The exploration space from which the near-optimum parameters are selected, is reduced from 17 up to 30 orders of magnitude. Keywords code optimizations • data cache • register blocking • loop tiling • high performance • energy consumption • data reuse Address(es) of author(s) should be given E(total)

show abstract

Section: Proposed Methodologymentioning

confidence: 99%

“…No more than p threads run in parallel, one to each core, where p is the number of the cores. Different threads access only their assigned shared cache space and thus different thread tiles do not conflict with each other [21].…”

Section: Approximate the Number Of Memory Accesses And Arithmetical Imentioning

confidence: 99%

A methodology correlating code optimizations with data memory accesses, execution time and energy consumption

Kelefouras

Djemame

2019

J Supercomput

Self Cite

View full text Add to dashboard Cite

show abstract

“…In [21], authors use an autotuning method to find the tile sizes, when the outermost loop is parallelised. In [11], loop tiling is combined with cache partitioning to improve performance in shared caches. Finally, in [22], a hybrid model is proposed by combining an analytical with an empirical model.…”

Section: Related Workmentioning

confidence: 99%

An Analytical Model for Loop Tiling Transformation

Kelefouras

Djemame

Κεραμίδας

et al. 2022

Lecture Notes in Computer Science

View full text Add to dashboard Cite

Loop tiling is a well-known loop transformation that enhances data locality in memory hierarchy. In this paper, we initially reveal two important inefficiencies of current analytical loop tiling models and we provide the theoretical background on how current analytical models can address these inefficiencies. To this end, we propose a new analytical model which is more accurate that the existing ones. We showcase, both theoretically and experimentally, that the proposed model can accurately estimate the number of cache misses for every generated tile size and as a result more efficient tile sizes are opted. Our evaluation results provide high cache misses gains and significant performance gains over gcc compiler and Pluto tool on an x86 platform.

show abstract

“…In [4], authors present defensive tiling, a technique to minimize cache misses in inclusion shared caches, when multiple programs run simultaneously. In [17], loop tiling is combined with cache partitioning to improve performance in shared caches.…”

Section: Related Workmentioning

confidence: 99%