GPU-STREAM v2.0: Benchmarking the Achievable Memory Bandwidth of Many-Core Processors Across Diverse Parallel Programming Models

Deakin, Tom; Price, James; Martineau, Matt; McIntosh–Smith, Simon

doi:10.1007/978-3-319-46079-6_34

Cited by 53 publications

(43 citation statements)

References 5 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…It is clear that the GPU implementation provides a speedup of around 4× over the CPU implementation. The STREAM benchmark [17] achieves a memory bandwidth of 32 GBytes/s on the Opteron CPUs, whilst GPU-STREAM [8] achieves 182 GBytes/s on the K20X GPUs, a 5.7× improvement in memory bandwidth of the GPU over the CPU. These benchmarks have no communication costs associated with them as they are simply run on a single node.…”

Section: Weak Scalingmentioning

confidence: 99%

See 1 more Smart Citation

Many-Core Acceleration of a Discrete Ordinates Transport Mini-App at Extreme Scale

Deakin

McIntosh–Smith

Gaudin

2016

Lecture Notes in Computer Science

Self Cite

View full text Add to dashboard Cite

Abstract. Time-dependent deterministic discrete ordinates transport codes are an important class of application which provide significant challenges for large, many-core systems. One such challenge is the large memory capacity needed by the solve step, which requires us to have a scalable solution in order to have enough node-level memory to store all the data. In our previous work, we demonstrated the first implementation which showed a significant performance benefit for single node solves using GPUs. In this paper we extend our work to large problems and demonstrate the scalability of our solution on two Petascale GPU-based supercomputers: Titan at Oak Ridge and Piz Daint at CSCS. Our results show that our improved node-level parallelism scheme scales just as well across large systems as previous approaches when using the tried and tested KBA domain decomposition technique. We validate our results against an improved performance model which predicts the runtime of the main 'sweep' routine when running on different hardware, including CPUs or GPUs.

show abstract

Section: Weak Scalingmentioning

confidence: 99%

“…The GPU implementation provides a speedup of up to 2× over the original implementation running on the CPU. The STREAM benchmark [17] achieves a memory bandwidth of 41 GBytes/s on the single socket Xeon compared to 182 GBytes/s for GPU-STREAM on the K20X [8].…”

Section: Piz Daintmentioning

confidence: 99%

Many-Core Acceleration of a Discrete Ordinates Transport Mini-App at Extreme Scale

Deakin

McIntosh–Smith

Gaudin

2016

Lecture Notes in Computer Science

Self Cite

View full text Add to dashboard Cite

show abstract

“…Calore et al reported achieving only 165 GB/s of bandwidth on a processor with a peak bandwidth of 352 GB/s. In comparison, the contemporary K20X GPU has a theoretical peak of 250 GB/s and achieves 182 GB/s of bandwidth.…”

Section: Introductionmentioning

confidence: 99%

High‐performance SIMD implementation of the lattice‐Boltzmann method on the Xeon Phi processor

Robertsén

Mattila

Westerholm

2018

Concurrency and Computation

View full text Add to dashboard Cite

Summary We present a high‐performance implementation of the lattice‐Boltzmann method (LBM) on the Knights Landing generation of Xeon Phi. The Knights Landing architecture includes 16GB of high‐speed memory (MCDRAM) with a reported bandwidth of over 400 GB/s, and a subset of the AVX‐512 single instruction multiple data (SIMD) instruction set. We explain five critical implementation aspects for high performance on this architecture: (1) the choice of appropriate LBM algorithm, (2) suitable data layout, (3) vectorization of the computation, (4) data prefetching, and (5) running our LBM simulations exclusively from the MCDRAM. The effects of these implementation aspects on the computational performance are demonstrated with the lattice‐Boltzmann scheme involving the D3Q19 discrete velocity set and the TRT collision operator. In our benchmark simulations of fluid flow through porous media, using double‐precision floating‐point arithmetic, the observed performance exceeds 960 million fluid lattice site updates per second.

show abstract

“…The most well-known memory benchmark in HPC is STREAM [12]. BabelStream [13] is a popular implementation of this benchmark with support for different programming languages and devices. However, it does not support FPGAs and only provides a small subset of the functionality of our benchmark suite.…”

Section: Related Workmentioning

confidence: 99%

The Memory Controller Wall: Benchmarking the Intel FPGA SDK for OpenCL Memory Interface

Zohouri

Matsuoka

2019

2019 IEEE/ACM International Workshop on Heterogeneous High-Performance Reconfigurable Computing (H2RC)

View full text Add to dashboard Cite

Supported by their high power efficiency and recent advancements in High Level Synthesis (HLS), FPGAs are quickly finding their way into HPC and cloud systems. Large amounts of work have been done so far on loop and area optimizations for different applications on FPGAs using HLS. However, a comprehensive analysis of the behavior and efficiency of the memory controller of FPGAs is missing in literature, which becomes even more crucial when the limited memory bandwidth of modern FPGAs compared to their GPU counterparts is taken into account. In this work, we will analyze the memory interface generated by Intel FPGA SDK for OpenCL with different configurations for input/output arrays, vector size, interleaving, kernel programming model, on-chip channels, operating frequency, padding, and multiple types of overlapped blocking. Our results point to multiple shortcomings in the memory controller of Intel FPGAs, especially with respect to memory access alignment, that can hinder the programmer's ability in maximizing memory performance in their design. For some of these cases, we will provide work-arounds to improve memory bandwidth efficiency; however, a general solution will require major changes in the memory controller itself.

show abstract

GPU-STREAM v2.0: Benchmarking the Achievable Memory Bandwidth of Many-Core Processors Across Diverse Parallel Programming Models

Cited by 53 publications

References 5 publications

Many-Core Acceleration of a Discrete Ordinates Transport Mini-App at Extreme Scale

Many-Core Acceleration of a Discrete Ordinates Transport Mini-App at Extreme Scale

High‐performance SIMD implementation of the lattice‐Boltzmann method on the Xeon Phi processor

The Memory Controller Wall: Benchmarking the Intel FPGA SDK for OpenCL Memory Interface

Contact Info

Product

Resources

About