An Evaluation of Emerging Many-Core Parallel Programming Models

Martineau, Matt; McIntosh–Smith, Simon; Boulton, Mike; Gaudin, Wayne

doi:10.1145/2883404.2883420

Cited by 38 publications

(31 citation statements)

References 18 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…To the authors knowledge, the only study that has compared the same simple benchmark in all the programming models of interest across a wide range of devices is one they themselves performed, where the TeaLeaf heat diffusion miniapp from the Mantevo benchmark suite was used in a similar manner to measure performance portability [9,6].…”

Section: Related Workmentioning

confidence: 99%

GPU-STREAM v2.0: Benchmarking the Achievable Memory Bandwidth of Many-Core Processors Across Diverse Parallel Programming Models

Deakin

Price

Martineau

et al. 2016

Lecture Notes in Computer Science

Self Cite

View full text Add to dashboard Cite

Abstract. Many scientific codes consist of memory bandwidth bound kernels -the dominating factor of the runtime is the speed at which data can be loaded from memory into the Arithmetic Logic Units, before results are written back to memory. One major advantage of many-core devices such as General Purpose Graphics Processing Units (GPGPUs) and the Intel Xeon Phi is their focus on providing increased memory bandwidth over traditional CPU architectures. However, as with CPUs, this peak memory bandwidth is usually unachievable in practice and so benchmarks are required to measure a practical upper bound on expected performance.The choice of one programming model over another should ideally not limit the performance that can be achieved on a device. GPU-STREAM has been updated to incorporate a wide variety of the latest parallel programming models, all implementing the same parallel scheme. As such this tool can be used as a kind of Rosetta Stone which provides both a cross-platform and cross-programming model array of results of achievable memory bandwidth.

show abstract

Section: Related Workmentioning

confidence: 99%

GPU-STREAM v2.0: Benchmarking the Achievable Memory Bandwidth of Many-Core Processors Across Diverse Parallel Programming Models

Deakin

Price

Martineau

et al. 2016

Lecture Notes in Computer Science

Self Cite

View full text Add to dashboard Cite

show abstract

“…Lin et al [7] used the ROSE source-to-source compiler to port a number of stencil applications, investigating performance and productivity. In our previous work, we compared the performance of a number of parallel programming models, including OpenMP 4.0, Kokkos, and RAJA [8]. We later discussed the performance of OpenMP 4.0 ports of the TeaLeaf, CloverLeaf, and BUDE mini-apps on NVIDIA GPUs [9].…”

Section: Concluding Suggestions For Performance Portabilitymentioning

confidence: 99%

“…Faced with the plethora of parallel programming models currently available, we expect many developers will see OpenMP 4.x as a familiar and attractive option that can balance performance, portability, productivity and maintainability [8]. Of course, there are no guarantees of performance portability offered by the specification and the divergence of existing implementations means that it is currently possible to write code that is non-portable between different implementations even targeting the same architecture.…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Pragmatic Performance Portability with OpenMP 4.x

Martineau

Price

McIntosh–Smith

et al. 2016

OpenMP: Memory, Devices, and Tasks

Self Cite

View full text Add to dashboard Cite

Abstract. In this paper we investigate the current compiler technologies supporting OpenMP 4.x features targeting a range of devices, in particular, the Cray compiler 8.5.0 targeting an Intel Xeon Broadwell and NVIDIA K20x, IBM's OpenMP 4.5 Clang branch (clang-ykt) targeting an NVIDIA K20x, the Intel compiler 16 targeting an Intel Xeon Phi Knights Landing, and GCC 6.1 targeting an AMD APU. We outline the mechanisms that they use to map the OpenMP model onto their target architectures, and conduct performance testing with a number of representative data parallel kernels. Following this we present a discussion about the current state of play in terms of performance portability and propose some straightforward guidelines for writing performance portable code, derived from our observations. At the time of writing, developers will likely have to rely on the pre-processor for certain kernels to achieve functional portability, but we expect that future homogenisation of required directives between compilers and architectures is feasible.

show abstract

“…Martineau et al [5], [18], [19] discuss several variants of TeaLeaf that have been parallelised using a number of programming models. Further, they compare different solvers within TeaLeaf: Conjugate Gradient (CG), Chebyshev and Chebyshev polynomially preconditioned CG (PPCG), on three different Intel Xeon processors, an IBM Power8 processor, an NVIDIA Tesla K20x GPU and an Intel Knights Corner accelerator card [5], [18], [19]. Recently, TeaLeaf was reengineered to use the OPS [6] embedded domain specific language, and the Kokkos [7] and RAJA [8] C++ template libraries.…”

Section: Introductionmentioning

confidence: 99%

Achieving Performance Portability for a Heat Conduction Solver Mini-Application on Modern Multi-core Systems

Kirk

Mudalige

Reguly

et al. 2017

2017 IEEE International Conference on Cluster Computing (CLUSTER)

View full text Add to dashboard Cite

Abstract-Modernizing production-grade, often legacy applications to take advantage of modern multi-core and many-core architectures can be a difficult and costly undertaking. This is especially true currently, as it is unclear which architectures will dominate future systems. The complexity of these codes can mean that parallelisation for a given architecture requires significant re-engineering. One way to assess the benefit of such an exercise would be to use mini-applications that are representative of the legacy programs.In this paper, we investigate different implementations of TeaLeaf, a mini-application from the Mantevo suite that solves the linear heat conduction equation. TeaLeaf has been ported to use many parallel programming models, including OpenMP, CUDA and MPI among others. It has also been re-engineered to use the OPS embedded DSL and template libraries Kokkos and RAJA. We use these different implementations to assess the performance portability of each technique on modern multi-core systems.While manually parallelising the application targeting and optimizing for each platform gives the best performance, this has the obvious disadvantage that it requires the creation of different versions for each and every platform of interest. Frameworks such as OPS, Kokkos and RAJA can produce executables of the program automatically that achieve comparable portability. Based on a recently developed performance portability metric, our results show that OPS and RAJA achieve an application performance portability score of 71% and 77% respectively for this application.

show abstract

An Evaluation of Emerging Many-Core Parallel Programming Models

Cited by 38 publications

References 18 publications

GPU-STREAM v2.0: Benchmarking the Achievable Memory Bandwidth of Many-Core Processors Across Diverse Parallel Programming Models

GPU-STREAM v2.0: Benchmarking the Achievable Memory Bandwidth of Many-Core Processors Across Diverse Parallel Programming Models

Pragmatic Performance Portability with OpenMP 4.x

Achieving Performance Portability for a Heat Conduction Solver Mini-Application on Modern Multi-core Systems

Contact Info

Product

Resources

About