2016
DOI: 10.1007/978-3-319-46079-6_34
|View full text |Cite
|
Sign up to set email alerts
|

GPU-STREAM v2.0: Benchmarking the Achievable Memory Bandwidth of Many-Core Processors Across Diverse Parallel Programming Models

Abstract: Abstract. Many scientific codes consist of memory bandwidth bound kernels -the dominating factor of the runtime is the speed at which data can be loaded from memory into the Arithmetic Logic Units, before results are written back to memory. One major advantage of many-core devices such as General Purpose Graphics Processing Units (GPGPUs) and the Intel Xeon Phi is their focus on providing increased memory bandwidth over traditional CPU architectures. However, as with CPUs, this peak memory bandwidth is usually… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

1
39
0

Year Published

2016
2016
2023
2023

Publication Types

Select...
4
3
1

Relationship

3
5

Authors

Journals

citations
Cited by 53 publications
(43 citation statements)
references
References 5 publications
1
39
0
Order By: Relevance
“…It is clear that the GPU implementation provides a speedup of around 4× over the CPU implementation. The STREAM benchmark [17] achieves a memory bandwidth of 32 GBytes/s on the Opteron CPUs, whilst GPU-STREAM [8] achieves 182 GBytes/s on the K20X GPUs, a 5.7× improvement in memory bandwidth of the GPU over the CPU. These benchmarks have no communication costs associated with them as they are simply run on a single node.…”
Section: Weak Scalingmentioning
confidence: 99%
See 1 more Smart Citation
“…It is clear that the GPU implementation provides a speedup of around 4× over the CPU implementation. The STREAM benchmark [17] achieves a memory bandwidth of 32 GBytes/s on the Opteron CPUs, whilst GPU-STREAM [8] achieves 182 GBytes/s on the K20X GPUs, a 5.7× improvement in memory bandwidth of the GPU over the CPU. These benchmarks have no communication costs associated with them as they are simply run on a single node.…”
Section: Weak Scalingmentioning
confidence: 99%
“…The GPU implementation provides a speedup of up to 2× over the original implementation running on the CPU. The STREAM benchmark [17] achieves a memory bandwidth of 41 GBytes/s on the single socket Xeon compared to 182 GBytes/s for GPU-STREAM on the K20X [8].…”
Section: Piz Daintmentioning
confidence: 99%
“…Calore et al reported achieving only 165 GB/s of bandwidth on a processor with a peak bandwidth of 352 GB/s. In comparison, the contemporary K20X GPU has a theoretical peak of 250 GB/s and achieves 182 GB/s of bandwidth.…”
Section: Introductionmentioning
confidence: 99%
“…The most well-known memory benchmark in HPC is STREAM [12]. BabelStream [13] is a popular implementation of this benchmark with support for different programming languages and devices. However, it does not support FPGAs and only provides a small subset of the functionality of our benchmark suite.…”
Section: Related Workmentioning
confidence: 99%