2020
DOI: 10.1007/978-3-030-50743-5_21
|View full text |Cite
|
Sign up to set email alerts
|

Understanding HPC Benchmark Performance on Intel Broadwell and Cascade Lake Processors

Abstract: Hardware platforms in high performance computing are constantly getting more complex to handle even when considering multicore CPUs alone. Numerous features and configuration options in the hardware and the software environment that are relevant for performance are not even known to most application users or developers. Microbenchmarks, i.e., simple codes that fathom a particular aspect of the hardware, can help to shed light on such issues, but only if they are well understood and if the results can be reconc… Show more

Help me understand this report
View preprint versions

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

1
13
0

Year Published

2020
2020
2023
2023

Publication Types

Select...
3
2
2

Relationship

2
5

Authors

Journals

citations
Cited by 16 publications
(14 citation statements)
references
References 20 publications
1
13
0
Order By: Relevance
“…Therefore, emulating simulations on more than 128 MPI processes on DEEP-EST CM using the NEST dry-run mode [40] may provide further insights. A comprehensive comparison between the two generations of processors based on microbenchmarks as presented in [46] for the related microarchitectures Intel Broadwell and Intel Cascade Lake is out of the scope of this study. It is our hope though that our efforts to present the changes to code in an abstract fashion make this study relevant for a broader computer science community and might even inspire the definition of future microbenchmarks.…”
Section: Discussionmentioning
confidence: 99%
“…Therefore, emulating simulations on more than 128 MPI processes on DEEP-EST CM using the NEST dry-run mode [40] may provide further insights. A comprehensive comparison between the two generations of processors based on microbenchmarks as presented in [46] for the related microarchitectures Intel Broadwell and Intel Cascade Lake is out of the scope of this study. It is our hope though that our efforts to present the changes to code in an abstract fashion make this study relevant for a broader computer science community and might even inspire the definition of future microbenchmarks.…”
Section: Discussionmentioning
confidence: 99%
“…The transparent huge pages (THP) setting was put to "always," and the NUMA balancing feature was turned off in order to reduce the performance impact from these settings [4]. All prefetching mechanisms in the hardware were enabled.…”
Section: Methodsmentioning
confidence: 99%
“…In summary, the desynchronization of the SymGS kernel, in itself a negligible effect, leads to some cores executing the DDOT2 kernel faster due to overlapping with idleness in the MPI_Allreduce function. This is the real reason for the "faster-than-light" performance observed in [4].…”
Section: A Motivation and Related Workmentioning
confidence: 99%
See 1 more Smart Citation
“…An analytic model of overlapping computational kernels and communication time on a contention domain was lacking, however. Alappat et al 12 observed the consequences of desynchronization in the context of the well‐known HPCG benchmark: * While validating their Roofline model for the MPI‐parallel HPCG, they observed that the DDOT2 (dot product s+=a[i]*b[i]) kernels were in fact faster than what the local memory bandwidth would allow for. The authors attributed this to process desynchronization during the preceding sparse matrix‐vector multiplication (SpMV) kernel and assumed that MPI processes that started the DDOT2 kernel early could benefit from immediate cache reuse, which was backed by a measured computational intensity that was higher than expected.…”
Section: Introductionmentioning
confidence: 99%