Performance Analysis of the Kahan-Enhanced Scalar Product on Current Multicore Processors

Hofmann, Jan; Fey, Dietmar; Riedmann, M.; Eitzinger, Jan; Hager, Georg; Wellein, Gerhard

doi:10.1007/978-3-319-32149-3_7

Cited by 9 publications

(12 citation statements)

References 17 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Here we evaluate the correctness of predictions derived by Kerncraft from kernel codes and hardware descriptions. It is out of the scope of this work to evaluate the underlying performance models; this has been discussed elsewhere [4,23,8,18,11,10]. We will, however, compare predictions by Kerncraft to predictions derived by manual analysis in previously published papers (see Table 5) and point out relevant differences and peculiarities.…”

Section: Discussionmentioning

confidence: 99%

“…Here we will have a look at the Kahan-compensated double-precision dot product and the Schönauer Triad. These have been analyzed thoroughly in [11] and [8], respectively.…”

Section: Streaming Kernelsmentioning

confidence: 99%

“…It provides a scalar product of two arrays a[] and b[], correcting for round-off errors due to the finite-precision floating-point number representation [14]. As was shown in [11], current compilers fail in generating efficient (or correct) machine code from the C source. In our case the compiler could not use SIMD vectorization due to the presence of a loop-carried dependency, but it produced correct scalar code without further unrolling.…”

Section: Kahan-ddotmentioning

confidence: 99%

“…The result for T nOL on Sandy Bridge in Table 5 differs from the reference result, since the latter was produced with a scalar but otherwise optimal version of the code that used modulo unrolling to hide the inter-iteration stalls. Note also that [11] uses latency penalties to make the ECM model work better in memory. The Kerncraft tool has this capability (in fact, the penalty cycles are part of the machine files), but it is deactivated by default.…”

Section: Kahan-ddotmentioning

confidence: 99%

See 3 more Smart Citations

Automatic loop kernel analysis and performance modeling with Kerncraft

Hammer

Hager

Eitzinger

et al. 2015

Proceedings of the 6th International Workshop on Performance Modeling, Benchmarking, and Simulation of High Performance Computi

Self Cite

View full text Add to dashboard Cite

Analytic performance models are essential for understanding the performance characteristics of loop kernels, which consume a major part of CPU cycles in computational science. Starting from a validated performance model one can infer the relevant hardware bottlenecks and promising optimization opportunities. Unfortunately, analytic performance modeling is often tedious even for experienced developers since it requires in-depth knowledge about the hardware and how it interacts with the software. We present the "Kerncraft" tool, which eases the construction of analytic performance models for streaming kernels and stencil loop nests. Starting from the loop source code, the problem size, and a description of the underlying hardware, Kerncraft can ideally predict the singlecore performance and scaling behavior of loops on multicore processors using the Roofline or the Execution-Cache-Memory (ECM) model. We describe the operating principles of Kerncraft with its capabilities and limitations, and we show how it may be used to quickly gain insights by accelerated analytic modeling.Supporting tools are employed to determine parameters that are required as model input in the machine description. We use the LIK-WID tool suite [19] for most of these tasks: The machine topology, i.e., information about core and cache sharing, ccNUMA structure, cache sizes, etc., is extracted from the output of likwidtopology. Achievable bandwidths to caches and main memory are measured with the likwid-bench tool [20], since it provides a controlled and compiler-independent environment for building tailored benchmark loops. Any analytic performance model must be checked for validity by comparing its predictions with measurements on the target hardware. The validation of predictions with measurements is an integral part of the Kerncraft tool.This paper is organized as follows: In Sect. 2 we briefly describe the components of the performance models (in-core model, Roofline, and ECM) supported by Kerncraft. Section 3 introduces the hardware and software used for all experiments. Details about the structure of the Kerncraft tool and its concrete implementation arXiv:1509.03778v2 [cs.PF] 5 Nov 2015 Listing 1: Scalar product in double precision double a [] , b [] , s =0.; for ( i =0; i < N ; ++ i ) s += a [ i ] * b [ i ];are given in Sect. 4. In Sect. 5 we evaluate the tool using streaming and stencil loop codes, and Sect. 7 gives a summary and an outlook to future work.The current version of Kerncraft is available for download at https://github.com/RRZE-HPC/kerncraft.

show abstract

Section: Discussionmentioning

confidence: 99%

“…Here we will have a look at the Kahan-compensated double-precision dot product and the Schönauer Triad. These have been analyzed thoroughly in [11] and [8], respectively.…”

Section: Streaming Kernelsmentioning

confidence: 99%

Section: Kahan-ddotmentioning

confidence: 99%

Section: Kahan-ddotmentioning

confidence: 99%

See 2 more Smart Citations

Automatic loop kernel analysis and performance modeling with Kerncraft

Hammer

Hager

Eitzinger

et al. 2015

Proceedings of the 6th International Workshop on Performance Modeling, Benchmarking, and Simulation of High Performance Computi

Self Cite

View full text Add to dashboard Cite

show abstract

“…The ECM model [19,3,18,4,5] is an analytic performance model that, with the exception of sustained memory bandwidth, works exclusively with architecture specifications as inputs. The model estimates the numbers of CPU cycles required to execute a number of iterations of a loop on a single core of a multi-or manycore chip.…”

Section: The Ecm Performance Modelmentioning

confidence: 99%

An ECM-based Energy-Efficiency Optimization Approach for Bandwidth-Limited Streaming Kernels on Recent Intel Xeon Processors

Hofmann

Fey

2016

2016 4th International Workshop on Energy Efficient Supercomputing (E2SC)

Self Cite

View full text Add to dashboard Cite

We investigate an approach that uses low-level analysis and the execution-cache-memory (ECM) performance model in combination with tuning of hardware parameters to lower energy requirements of memory-bound applications. The ECM model is extended appropriately to deal with software optimizations such as nontemporal stores. Using incremental steps and the ECM model, we analytically quantify the impact of various single-core optimizations and pinpoint microarchitectural improvements that are relevant to energy consumption. Using a 2D Jacobi solver as example that can serve as a blueprint for other memory-bound applications, we evaluate our approach on the four most recent Intel Xeon E5 processors (Sandy Bridge-EP, Ivy Bridge-EP, Haswell-EP, and Broadwell-EP). We find that chip energy consumption can be reduced in the range of 2.0-2.4× on the examined processors.

show abstract

Performance analysis of the Kahan‐enhanced scalar product on current multi‐core and many‐core processors

Hofmann

Fey

Riedmann³

et al. 2016

Concurrency and Computation

Self Cite

View full text Add to dashboard Cite

We investigate the performance characteristics of a numerically enhanced scalar product (dot) kernel loop that uses the Kahan algorithm to compensate for numerical errors, and describe efficient single instruction multiple data-vectorized implementations on recent multi-core and many-core processors. Using low-level instruction analysis and the execution-cache-memory performance model, we pinpoint the relevant performance bottlenecks for single-core and thread-parallel execution and predict performance and saturation behavior. We show that the Kahan-enhanced scalar product comes at almost no additional cost compared with the naive (non-Kahan) scalar product if appropriate low-level optimizations, notably single instruction multiple data vectorization and unrolling, are applied. The execution-cache-memory model is extended appropriately to accommodate not only modern Intel multicore chips but also the Intel Xeon Phi 'Knights Corner' coprocessor and an IBM POWER8 CPU. This allows us to discuss the impact of processor features on the performance across four modern architectures that are relevant for high performance computing. Figure 7. Single-core cycles per CL versus data set size on PWR8: (a) Results for different SMT settings for naive scalar product using SP; (b) Comparison of compiler-generated naive scalar product and manual SIMD Kahan enhanced scalar product using SMT-8. The horizontal lines are ECM model predictions.

show abstract

Performance Analysis of the Kahan-Enhanced Scalar Product on Current Multicore Processors

Cited by 9 publications

References 17 publications

Automatic loop kernel analysis and performance modeling with Kerncraft

Automatic loop kernel analysis and performance modeling with Kerncraft

An ECM-based Energy-Efficiency Optimization Approach for Bandwidth-Limited Streaming Kernels on Recent Intel Xeon Processors

Performance analysis of the Kahan‐enhanced scalar product on current multi‐core and many‐core processors

Contact Info

Product

Resources

About