Performance Analysis with Cache-Aware Roofline Model in Intel Advisor

Marques, Diogo; Duarte, H.; Ilić, Aleksandar; Sousa, Leonel; Belenov, Roman; Thierry, Philippe; Matveev, Zakhar A.

doi:10.1109/hpcs.2017.150

Cited by 22 publications

(12 citation statements)

References 12 publications

Supporting

Mentioning

Contrasting

Unclassified

Order By: Relevance

“…Our contribution to the CARM can also advance its current implementation in the Intel proprietary tool's, referred as Intel Advisor Roofline [21], and for which some of the authors of this paper published concrete use cases [22]. Unlike Intel Advisor Roofline, we keep track of the MCDRAM bandwidth in several aspects, and provide additional insights about potential bottlenecks and characteristics of NUMA systems.…”

Section: Related Workmentioning

confidence: 99%

Modeling Non-Uniform Memory Access on Large Compute Nodes with the Cache-Aware Roofline Model

Denoyelle

Goglin

Ilić

et al. 2019

IEEE Trans. Parallel Distrib. Syst.

Self Cite

View full text Add to dashboard Cite

NUMA platforms, emerging memory architectures with on-package high bandwidth memories bring new opportunities and challenges to bridge the gap between computing power and memory performance. Heterogeneous memory machines feature several performance trade-offs, depending on the kind of memory used, when writing or reading it. Finding memory performance upper-bounds subject to such trade-offs aligns with the numerous interests of measuring computing system performance. In particular, representing applications performance with respect to the platform performance bounds has been addressed in the state-of-the-art Cache-Aware Roofline Model (CARM) to troubleshoot performance issues. In this paper, we present a Locality-Aware extension (LARM) of the CARM to model NUMA platforms bottlenecks, such as contention and remote access. On top of this, the new contribution of this paper is the design and validation of a novel hybrid memory bandwidth model. This new hybrid model quantifies the achievable bandwidth upper-bound under above-described trade-offs with less than 3% error. Hence, when comparing applications performance with the maximum attainable performance, software designers can now rely on more accurate information.

show abstract

Section: Related Workmentioning

confidence: 99%

Modeling Non-Uniform Memory Access on Large Compute Nodes with the Cache-Aware Roofline Model

Denoyelle

Goglin

Ilić

et al. 2019

IEEE Trans. Parallel Distrib. Syst.

Self Cite

View full text Add to dashboard Cite

show abstract

“…The second one comes with two Intel Xeon 2697v4 processors based on the previous to study the performance and the limitations of our implementations. However, in [5], [3], [4], authors show some limitations of the ORM when the model is used to drive the optimization process. These works propose other roofline models, such as the Cache Aware Roofline Model (CARM) and Locality Aware Roofline Model (LARM), which take into account more architectural details.…”

Section: Experimental Setup a Experimental Contextmentioning

confidence: 99%

“…Efforts to prepare applications for this upcoming system should rely on a deep understanding of the algorithms to predict the performance. In fact, many works, like [1], [2], [3], [4], [5], deal with the efficiency concern from the interrelation between hardware and algorithms. Finite-element methods are representative of such situation, as these numerical approaches are at the heart of many open-source or commercial software packages [6], [7], [8].…”

Section: Introductionmentioning

confidence: 99%

Performance Analysis of SIMD Vectorization of High-Order Finite-Element Kernels

Sornet

Jubertie

Dupros

et al. 2018

2018 International Conference on High Performance Computing &Amp; Simulation (HPCS)

View full text Add to dashboard Cite

Physics-based three-dimensional numerical simulations are becoming more predictive and are already essential for improving the understanding of natural phenomena, such as earthquakes, tsunami, flooding or climate change and global warming. Among the numerical methods available to support these simulations, Finite-Element formulations have been implemented in several major software packages. The efficiency of these algorithms remains a challenge due to the irregular memory access that prevents the squeezing out of the maximum level of performance out of current architectures. This is particularly true at the shared-memory level with several levels of parallelism and complex memory hierarchies. Despite significant efforts, automatic optimizations provided by compilers and high-level frameworks are often far from the performances obtained from hand-tuned implementations. In this paper, we have extracted a kernel from the EFISPEC software package developed at BRGM (the French Geological Survey). This application implements a high-order finite-element method to solve the elastodynamic equation. We characterize the performance of the extracted mini-app considering key parameters such as the order of the approximation, the memory access pattern or the vector length. Based on this study, we detail specific optimizations and we discuss the results measured as regards to the roofline performance model on Intel Broadwell and Skylake architectures.

show abstract

“…Our contribution to the CARM also advances its current implementation in the Intel proprietary tool's, referred as Intel Advisor Roofline [22], and for which some author of this paper published concrete cases usage [23]. Unlike Intel Advisor Roofline, we keep track of the MCDRAM bandwidth in several aspects, and provide additional insights about potential bottlenecks and characteristics of NUMA systems.…”

Section: Related Workmentioning

confidence: 99%

Modeling Large Compute Nodes with Heterogeneous Memories with Cache-Aware Roofline Model

Denoyelle

Goglin

Ilić

et al. 2017

Lecture Notes in Computer Science

Self Cite

View full text Add to dashboard Cite

Abstract. In order to fulfill modern applications needs, computing systems become more powerful, heterogeneous and complex. NUMA platforms and emerging high bandwidth memories offer new opportunities for performance improvements. However they also increase hardware and software complexity, thus making application performance analysis and optimization an even harder task. The Cache-Aware Roofline Model (CARM) is an insightful, yet simple model designed to address this issue. It provides feedback on potential applications bottlenecks and shows how far is the application performance from the achievable hardware upper-bounds. However, it does not encompass NUMA systems and next generation processors with heterogeneous memories. Yet, some application bottlenecks belong to those memory subsystems, and would benefit from the CARM insights. In this paper, we fill the missing requirements to scope recent large shared memory systems with the CARM. We provide the methodology to instantiate, and validate the model on a NUMA system as well as on the latest Xeon Phi processor equiped with configurable hybrid memory. Finally, we show the model ability to exhibits several bottlenecks of such systems, which were not supported by CARM.

show abstract

Performance Analysis with Cache-Aware Roofline Model in Intel Advisor

Cited by 22 publications

References 12 publications

Modeling Non-Uniform Memory Access on Large Compute Nodes with the Cache-Aware Roofline Model

Modeling Non-Uniform Memory Access on Large Compute Nodes with the Cache-Aware Roofline Model

Performance Analysis of SIMD Vectorization of High-Order Finite-Element Kernels

Modeling Large Compute Nodes with Heterogeneous Memories with Cache-Aware Roofline Model

Contact Info

Product

Resources

About