Understanding HPC Benchmark Performance on Intel Broadwell and Cascade Lake Processors

Alappat, Christie L.; Hofmann, Jan; Hager, Georg; Fehske, Holger; Bishop, A. R.; Wellein, Gerhard

doi:10.1007/978-3-030-50743-5_21

Cited by 16 publications

(14 citation statements)

References 20 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Therefore, emulating simulations on more than 128 MPI processes on DEEP-EST CM using the NEST dry-run mode [40] may provide further insights. A comprehensive comparison between the two generations of processors based on microbenchmarks as presented in [46] for the related microarchitectures Intel Broadwell and Intel Cascade Lake is out of the scope of this study. It is our hope though that our efforts to present the changes to code in an abstract fashion make this study relevant for a broader computer science community and might even inspire the definition of future microbenchmarks.…”

Section: Discussionmentioning

confidence: 99%

Routing brain traffic through the von Neumann bottleneck: Efficient cache usage in spiking neural network simulation code on general purpose computers

Pronold,

Jordan,

Wylie

et al. 2021

Preprint

View full text Add to dashboard Cite

Simulation is a third pillar next to experiment and theory in the study of complex dynamic systems such as biological neural networks. Contemporary brain-scale networks correspond to directed graphs of a few million nodes, each with an in-degree and out-degree of several thousands of edges, where nodes and edges correspond to the fundamental biological units, neurons and synapses, respectively. When considering a random graph, each node's edges are distributed across thousands of parallel processes. The activity in neuronal networks is also sparse. Each neuron occasionally transmits a brief signal, called spike, via its outgoing synapses to the corresponding target neurons. This spatial and temporal sparsity represents an inherent bottleneck for simulations on conventional computers: Fundamentally irregular memory-access patterns cause poor cache utilization. Using an established neuronal network simulation code as a reference implementation, we investigate how common techniques to recover cache performance such as software-induced prefetching and software pipelining can benefit a real-world application. The algorithmic changes reduce simulation time by up to 50%. The study exemplifies that many-core systems assigned with an intrinsically parallel computational problem can overcome the von Neumann bottleneck of conventional computer architectures.

show abstract

Section: Discussionmentioning

confidence: 99%

Routing brain traffic through the von Neumann bottleneck: Efficient cache usage in spiking neural network simulation code on general purpose computers

Pronold,

Jordan,

Wylie

et al. 2021

Preprint

View full text Add to dashboard Cite

show abstract

“…The transparent huge pages (THP) setting was put to "always," and the NUMA balancing feature was turned off in order to reduce the performance impact from these settings [4]. All prefetching mechanisms in the hardware were enabled.…”

Section: Methodsmentioning

confidence: 99%

“…In summary, the desynchronization of the SymGS kernel, in itself a negligible effect, leads to some cores executing the DDOT2 kernel faster due to overlapping with idleness in the MPI_Allreduce function. This is the real reason for the "faster-than-light" performance observed in [4].…”

Section: A Motivation and Related Workmentioning

confidence: 99%

“…An analytic model of overlapping computational kernels and communication time on a contention domain was lacking, however. Alappat et al [4] observed the consequences of desynchronization in the context of the well-known HPCG benchmark: 1 While validating their Roofline model for the MPI-parallel HPCG, they observed that the DDOT2 (dot product s+=a[i] * b[i]) kernels were in fact faster than what the local memory bandwidth would allow for. The and assumed that MPI processes that started the DDOT2 kernel early could benefit from immediate cache reuse, which was backed by a measured computational intensity that was higher than expected.…”

Section: A Motivation and Related Workmentioning

confidence: 99%

See 1 more Smart Citation

An analytic performance model for overlapping execution of memory-bound loop kernels on multicore CPUs

Afzal,

Hager,

Wellein

2020

Preprint

Self Cite

View full text Add to dashboard Cite

Complex applications running on multicore processors show a rich performance phenomenology. The growing number of cores per ccNUMA domain complicates performance analysis of memory-bound code since system noise, load imbalance, or task-based programming models can lead to thread desynchronization. Hence, the simplifying assumption that all cores execute the same loop can not be upheld. Motivated by observations on plain and modified versions of the HPCG benchmark, we construct a performance model of execution of memorybound loop kernels. It can predict the memory bandwidth share per kernel on a memory contention domain depending on the number of active cores and which other workload the kernel is paired with. The only code features required are the singlethread cache line access frequency per kernel, which is directly related to the single-thread memory bandwidth, and its saturated bandwidth. It can either be measured directly or predicted using the Execution-Cache-Memory (ECM) performance model. The computational intensity of the kernels and the detailed structure of the code is of no significance. We validate our model on Intel Broadwell, Intel Cascade Lake, and AMD Rome processors pairing various streaming and stencil kernels. The error in predicting the bandwidth share per kernel is less than 8%.

show abstract

“…An analytic model of overlapping computational kernels and communication time on a contention domain was lacking, however. Alappat et al 12 observed the consequences of desynchronization in the context of the well‐known HPCG benchmark: * While validating their Roofline model for the MPI‐parallel HPCG, they observed that the DDOT2 (dot product s+=a[i]*b[i]) kernels were in fact faster than what the local memory bandwidth would allow for. The authors attributed this to process desynchronization during the preceding sparse matrix‐vector multiplication (SpMV) kernel and assumed that MPI processes that started the DDOT2 kernel early could benefit from immediate cache reuse, which was backed by a measured computational intensity that was higher than expected.…”

Section: Introductionmentioning

confidence: 99%

Analytic performance model for parallel overlapping memory‐bound kernels

Afzal

Hager

Wellein

2022

Concurrency and Computation

Self Cite

View full text Add to dashboard Cite

Complex applications running on multicore processors show a rich performance phenomenology. The growing number of cores per ccNUMA domain complicates performance analysis of memory-bound code since system noise, load imbalance, or task-based programming models can lead to thread desynchronization. Hence, the simplifying assumption that all cores execute the same loop can not be upheld. Motivated by observations on plain and modified versions of the HPCG benchmark, we construct a performance model of execution of memory-bound loop kernels. It can predict the memory bandwidth share per kernel on a memory contention domain depending on the number of active cores and which other workload the kernel is paired with.The only code features required are the single-thread memory request fraction per kernel, which is directly related to the single-thread memory bandwidth, and its saturated bandwidth. The former can either be measured directly or predicted using the Execution-Cache-Memory performance model. The computational intensity of the kernels and the detailed structure of the code is of no significance. We validate our model on Intel Broadwell, Intel Cascade Lake, and AMD Rome processors pairing various streaming and stencil kernels. The error in predicting the bandwidth share per kernel is less than 8%.

show abstract

Understanding HPC Benchmark Performance on Intel Broadwell and Cascade Lake Processors

Cited by 16 publications

References 20 publications

Routing brain traffic through the von Neumann bottleneck: Efficient cache usage in spiking neural network simulation code on general purpose computers

Routing brain traffic through the von Neumann bottleneck: Efficient cache usage in spiking neural network simulation code on general purpose computers

An analytic performance model for overlapping execution of memory-bound loop kernels on multicore CPUs

Analytic performance model for parallel overlapping memory‐bound kernels

Contact Info

Product

Resources

About