Kerncraft: A Tool for Analytic Performance Modeling of Loop Kernels

Hammer, Julian; Eitzinger, Jan; Hager, Georg; Wellein, Gerhard

doi:10.1007/978-3-319-56702-0_1

Cited by 25 publications

(28 citation statements)

References 13 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…If a dataset fits in the first-level cache, all accesses will behave the same and there is no need to consider the order and pattern of previous accesses or (possibly undisclosed) cache replacement algorithms. Behavior beyond L1 can be modeled separately, but this is beyond the scope of this work (the Kerncraft tool [4], which relies on an in-core analysis from IACA and -in the future -OSACA, combines it with data analysis for a unified Roofline or ECM prediction). 2) Multiple available ports per instruction are utilized with fixed probabilities.…”

Section: A Backgroundmentioning

confidence: 99%

See 1 more Smart Citation

Automated Instruction Stream Throughput Prediction for Intel and AMD Microarchitectures

Laukemann

Hammer

Hofmann

et al. 2018

2018 IEEE/ACM Performance Modeling, Benchmarking and Simulation of High Performance Computer Systems (PMBS)

Self Cite

View full text Add to dashboard Cite

An accurate prediction of scheduling and execution of instruction streams is a necessary prerequisite for predicting the in-core performance behavior of throughput-bound loop kernels on out-of-order processor architectures. Such predictions are an indispensable component of analytical performance models, such as the Roofline and the Execution-Cache-Memory (ECM) model, and allow a deep understanding of the performancerelevant interactions between hardware architecture and loop code.We present the Open Source Architecture Code Analyzer (OSACA), a static analysis tool for predicting the execution time of sequential loops comprising x86 instructions under the assumption of an infinite first-level cache and perfect out-of-order scheduling. We show the process of building a machine model from available documentation and semi-automatic benchmarking, and carry it out for the latest Intel Skylake and AMD Zen micro-architectures.To validate the constructed models, we apply them to several assembly kernels and compare runtime predictions with actual measurements. Finally we give an outlook on how the method may be generalized to new architectures.

show abstract

Section: A Backgroundmentioning

confidence: 99%

“…Once known, the bottleneck can often be mitigated by changes in the code, the runtime parameters, or the execution environment. When the models' construction is automated [3], [4], compilers and a wider user base can take advantage of them.…”

Section: Introductionmentioning

confidence: 99%

Automated Instruction Stream Throughput Prediction for Intel and AMD Microarchitectures

Laukemann

Hammer

Hofmann

et al. 2018

2018 IEEE/ACM Performance Modeling, Benchmarking and Simulation of High Performance Computer Systems (PMBS)

Self Cite

View full text Add to dashboard Cite

show abstract

“…By including information available from performance models for the different algorithms, the workload estimator can be made more general and flexible. Tools like Kerncraft [41] automatically analyze the performance of a given implementation for the hardware at hand, which would render the estimator independent of these factors. Furthermore, a workload estimate based on the current runtimes is a natural alternative to the proposed predictor as it is able to use actual data from the currently running simulation.…”

Section: Resultsmentioning

confidence: 99%

Dynamic Load Balancing Techniques for Particulate Flow Simulations

Rettinger

Rüde

2019

Computation

View full text Add to dashboard Cite

Parallel multiphysics simulations often suffer from load imbalances originating from the applied coupling of algorithms with spatially and temporally varying workloads. It is thus desirable to minimize these imbalances to reduce the time to solution and to better utilize the available hardware resources. Taking particulate flows as an illustrating example application, we present and evaluate load balancing techniques that tackle this challenging task. This involves a load estimation step in which the currently generated workload is predicted. We describe in detail how such a workload estimator can be developed. In a second step, load distribution strategies like space-filling curves or graph partitioning are applied to dynamically distribute the load among the available processes. To compare and analyze their performance, we employ these techniques to a benchmark scenario and observe a reduction of the load imbalances by almost a factor of four. This results in a decrease of the overall runtime by 14% for space-filling curves.

show abstract

“…However, they require a deep understanding of the underlying micro-architecture in order to yield accurate results. Common (simplified) approaches for numerical kernels are the Roofline [1] model or the ECM [2] model, whose construction is supported by the Kerncraft open-source performance modeling tool [3]. For Roofline, the Roofline Model Toolkit [4] and Intel's Roofline Advisor 1 are also available.…”

Section: Introductionmentioning

confidence: 99%

“…With OSACA's semi-automatic benchmarking pipeline, compilers can benefit from an automated model construction [3], [4]. The instruction database is dynamically extendable, which enables users to adapt the tool to other application scenarios beyond numerical kernels found in HPC usecases.…”

Section: Introductionmentioning

confidence: 99%

Automatic Throughput and Critical Path Analysis of x86 and ARM Assembly Kernels

Laukemann

Hammer

Hager

et al. 2019

2019 IEEE/ACM Performance Modeling, Benchmarking and Simulation of High Performance Computer Systems (PMBS)

Self Cite

View full text Add to dashboard Cite

Useful models of loop kernel runtimes on out-oforder architectures require an analysis of the in-core performance behavior of instructions and their dependencies. While an instruction throughput prediction sets a lower bound to the kernel runtime, the critical path defines an upper bound. Such predictions are an essential part of analytic (i.e., whitebox) performance models like the Roofline and Execution-Cache-Memory (ECM) models. They enable a better understanding of the performance-relevant interactions between hardware architecture and loop code.The Open Source Architecture Code Analyzer (OSACA) is a static analysis tool for predicting the execution time of sequential loops. It previously supported only x86 (Intel and AMD) architectures and simple, optimistic full-throughput execution. We have heavily extended OSACA to support ARM instructions and critical path prediction including the detection of loopcarried dependencies, which turns it into a versatile crossarchitecture modeling tool. We show runtime predictions for code on Intel Cascade Lake, AMD Zen, and Marvell ThunderX2 micro-architectures based on machine models from available documentation and semi-automatic benchmarking. The predictions are compared with actual measurements.

show abstract

Kerncraft: A Tool for Analytic Performance Modeling of Loop Kernels

Cited by 25 publications

References 13 publications

Automated Instruction Stream Throughput Prediction for Intel and AMD Microarchitectures

Automated Instruction Stream Throughput Prediction for Intel and AMD Microarchitectures

Dynamic Load Balancing Techniques for Particulate Flow Simulations

Automatic Throughput and Critical Path Analysis of x86 and ARM Assembly Kernels

Contact Info

Product

Resources

About