2019 IEEE/ACM Performance Modeling, Benchmarking and Simulation of High Performance Computer Systems (PMBS) 2019
DOI: 10.1109/pmbs49563.2019.00006
|View full text |Cite
|
Sign up to set email alerts
|

Automatic Throughput and Critical Path Analysis of x86 and ARM Assembly Kernels

Abstract: Useful models of loop kernel runtimes on out-oforder architectures require an analysis of the in-core performance behavior of instructions and their dependencies. While an instruction throughput prediction sets a lower bound to the kernel runtime, the critical path defines an upper bound. Such predictions are an essential part of analytic (i.e., whitebox) performance models like the Roofline and Execution-Cache-Memory (ECM) models. They enable a better understanding of the performance-relevant interactions bet… Show more

Help me understand this report
View preprint versions

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

0
9
0

Year Published

2020
2020
2023
2023

Publication Types

Select...
6
3

Relationship

4
5

Authors

Journals

citations
Cited by 16 publications
(9 citation statements)
references
References 12 publications
0
9
0
Order By: Relevance
“…To create an accurate in-core model of the A64FX microarchitecture, we analyze different instruction forms, i.e., assembly instructions in combination with their operand types, based on the methodology introduced in [8,9]. † Table 2 shows a list of instruction forms relevant for this work.…”
Section: In-corementioning
confidence: 99%
See 1 more Smart Citation
“…To create an accurate in-core model of the A64FX microarchitecture, we analyze different instruction forms, i.e., assembly instructions in combination with their operand types, based on the methodology introduced in [8,9]. † Table 2 shows a list of instruction forms relevant for this work.…”
Section: In-corementioning
confidence: 99%
“…FIGURE9 Effect of MVE on the performance of SpMV using GCC with plain C code, explicit unrolling with GGC and MVE, and using the plain C code with the FCC compiler.…”
mentioning
confidence: 99%
“…In case of CP analysis, the Open Source Architecture Code Analyzer (OSACA) by Laukemann et al (2018) is planned to become a versatile substitute for IACA, which does not provide CP prediction for modern Intel CPUs. OSACA has recently been extended to support CP and loop-carried dependency detection (see Laukemann et al, 2019). Data latency support would require a fundamental modification of the model, and work is ongoing in this direction.…”
Section: Future Workmentioning
confidence: 99%
“…The model considers execution time contributions for steady-state loops from the core (assuming all data is in L1), data paths in the cache hierarchy, and the memory interface. For the core component, the loop's assembly code is analyzed for predictions of optimal throughput, critical path, and longest loop-carried dependency (the current development branch of the OSACA tool [5] has preliminary support for A64FX). Data transfer volumes through the memory hierarchy are obtained either by manual analysis or by the Kerncraft [6] tool; together with the known bandwidths of all data paths, time contributions for L1-L2 and L2-memory transfers are obtained.…”
Section: B Brief Overview Of the Ecm Modelmentioning
confidence: 99%