Automatic Throughput and Critical Path Analysis of x86 and ARM Assembly Kernels

Laukemann, Jan; Hammer, Julian; Hager, Georg; Wellein, Gerhard

doi:10.1109/pmbs49563.2019.00006

Cited by 16 publications

(9 citation statements)

References 12 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…To create an accurate in-core model of the A64FX microarchitecture, we analyze different instruction forms, i.e., assembly instructions in combination with their operand types, based on the methodology introduced in [8,9]. † Table 2 shows a list of instruction forms relevant for this work.…”

Section: In-corementioning

confidence: 99%

See 1 more Smart Citation

ECM modeling and performance tuning of SpMV and Lattice QCD on A64FX

Alappat,

Meyer,

Laukemann

et al. 2021

Preprint

Self Cite

View full text Add to dashboard Cite

The A64FX CPU is arguably the most powerful Arm-based processor design to date.Although it is a traditional cache-based multicore processor, its peak performance and memory bandwidth rival accelerator devices. A good understanding of its performance features is of paramount importance for developers who wish to leverage its full potential. We present an architectural analysis of the A64FX used in the Fujitsu FX1000 supercomputer at a level of detail that allows for the construction of Execution-Cache-Memory (ECM) performance models for steady-state loops. In the process we identify architectural peculiarities that point to viable generic optimization strategies. After validating the model using simple streaming loops we apply the insight gained to sparse matrix-vector multiplication (SpMV) and the domain wall (DW) kernel from quantum chromodynamics (QCD). For SpMV we show why the CRS matrix storage format is not a good practical choice on this architecture and how the SELL--format can achieve bandwidth saturation. For the DW kernel we provide a cache-reuse analysis and show how an appropriate choice of data layout for complex arrays can realize memory-bandwidth saturation in this case as well. A comparison with state-of-the-art high-end Intel Cascade Lake AP and Nvidia V100 systems puts the capabilities of the A64FX into perspective. We also explore the potential for power optimizations using the tuning knobs provided by the Fugaku system, achieving energy savings of about 31% for SpMV and 18% for DW.

show abstract

Section: In-corementioning

confidence: 99%

“…FIGURE9 Effect of MVE on the performance of SpMV using GCC with plain C code, explicit unrolling with GGC and MVE, and using the plain C code with the FCC compiler.…”

mentioning

confidence: 99%

ECM modeling and performance tuning of SpMV and Lattice QCD on A64FX

Alappat,

Meyer,

Laukemann

et al. 2021

Preprint

Self Cite

View full text Add to dashboard Cite

show abstract

“…In case of CP analysis, the Open Source Architecture Code Analyzer (OSACA) by Laukemann et al (2018) is planned to become a versatile substitute for IACA, which does not provide CP prediction for modern Intel CPUs. OSACA has recently been extended to support CP and loop-carried dependency detection (see Laukemann et al, 2019). Data latency support would require a fundamental modification of the model, and work is ongoing in this direction.…”

Section: Future Workmentioning

confidence: 99%

Analytic performance modeling and analysis of detailed neuron simulations

Cremonesi

Hager

Wellein

et al. 2020

The International Journal of High Performance Computing Applica

Self Cite

View full text Add to dashboard Cite

Big science initiatives are trying to reconstruct and model the brain by attempting to simulate brain tissue at larger scales and with increasingly more biological detail than previously thought possible. The exponential growth of parallel computer performance has been supporting these developments, and at the same time maintainers of neuroscientific simulation code have strived to optimally and efficiently exploit new hardware features. Current state of the art software for the simulation of biological networks has so far been developed using performance engineering practices, but a thorough analysis and modeling of the computational and performance characteristics, especially in the case of morphologically detailed neuron simulations, is lacking. Other computational sciences have successfully used analytic performance engineering and modeling methods to gain insight on the computational properties of simulation kernels, aid developers in performance optimizations and eventually drive co-design efforts, but to our knowledge a model-based performance analysis of neuron simulations has not yet been conducted. We present a detailed study of the shared-memory performance of morphologically detailed neuron simulations based on the Execution-Cache-Memory (ECM) performance model. We demonstrate that this model can deliver accurate predictions of the runtime of almost all the kernels that constitute the neuron models under investigation. The gained insight is used to identify the main governing mechanisms underlying performance bottlenecks in the simulation. The implications of this analysis on the optimization of neural simulation software and eventually co-design of future hardware architectures are discussed. In this sense, our work represents a valuable conceptual and quantitative contribution to understanding the performance properties of biological networks simulations.

show abstract

“…The model considers execution time contributions for steady-state loops from the core (assuming all data is in L1), data paths in the cache hierarchy, and the memory interface. For the core component, the loop's assembly code is analyzed for predictions of optimal throughput, critical path, and longest loop-carried dependency (the current development branch of the OSACA tool [5] has preliminary support for A64FX). Data transfer volumes through the memory hierarchy are obtained either by manual analysis or by the Kerncraft [6] tool; together with the known bandwidths of all data paths, time contributions for L1-L2 and L2-memory transfers are obtained.…”

Section: B Brief Overview Of the Ecm Modelmentioning

confidence: 99%

Performance Modeling of Streaming Kernels and Sparse Matrix-Vector Multiplication on A64FX

Alappat,

Laukemann,

Gruber

et al. 2020

Preprint

Self Cite

View full text Add to dashboard Cite

The A64FX CPU powers the current #1 supercomputer on the Top500 list. Although it is a traditional cachebased multicore processor, its peak performance and memory bandwidth rival accelerator devices. Generating efficient code for such a new architecture requires a good understanding of its performance features. Using these features, we construct the Execution-Cache-Memory (ECM) performance model for the A64FX processor in the FX700 supercomputer and validate it using streaming loops. We also identify architectural peculiarities and derive optimization hints. Applying the ECM model to sparse matrix-vector multiplication (SpMV), we motivate why the CRS matrix storage format is inappropriate and how the SELL-C-σ format with suitable code optimizations can achieve bandwidth saturation for SpMV.

show abstract

Automatic Throughput and Critical Path Analysis of x86 and ARM Assembly Kernels

Cited by 16 publications

References 12 publications

ECM modeling and performance tuning of SpMV and Lattice QCD on A64FX

ECM modeling and performance tuning of SpMV and Lattice QCD on A64FX

Analytic performance modeling and analysis of detailed neuron simulations

Performance Modeling of Streaming Kernels and Sparse Matrix-Vector Multiplication on A64FX

Contact Info

Product

Resources

About