Performance portability study of linear algebra kernels in OpenCL

Rupp, Karl; Tillet, Philippe; Rudolf, Florian; Weinbub, Josef; Jüngel, Ansgar

doi:10.1145/2664666.2664674

Cited by 6 publications

(9 citation statements)

References 12 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Benchmark results in Section 6 demonstrate that ViennaCL provides performance comparable to or better than vendor-tuned libraries for sparse matrix-vector products and sparse matrix-matrix products. These results complement earlier work, which reported competitive performance of ViennaCL for dense linear algebra operations [44,54]. In addition, benchmark results for pipelined iterative solvers with kernel fusion and two important types of preconditioners allow for a comparison of solver performance on different hardware platforms.…”

supporting

confidence: 85%

“…The parameters in the device database are also useful for the CUDA compute backend, since it includes the best parameters found for the local and global workgroup sizes. Because architectural differences for GPUs from NVIDIA are smaller than across vendors, we found that only setting proper workgroup sizes at runtime is enough to obtain good performance for memory-bandwidth limited kernels on NVIDIA GPUs [44].…”

Section: Device Databasementioning

confidence: 99%

“…However, there is no automatic performance portability: The optimization of a kernel for a particular target device does not imply good performance on a different device. Even though there is good correlation of performance across devices of the same type [44], target-specific characteristics of the device need to be taken into account for best performance. To generate such optimized kernels, a database holding the best kernel parameters for target devices is integrated into ViennaCL.…”

Section: Device Databasementioning

confidence: 99%

See 2 more Smart Citations

ViennaCL---Linear Algebra Library for Multi- and Many-Core Architectures

Rupp¹,

Tillet²,

Rudolf³

et al. 2016

SIAM J. Sci. Comput.

Self Cite

106

View full text Add to dashboard Cite

CUDA, OpenCL, and OpenMP are popular programming models for the multi-core architectures of CPUs and many-core architectures of GPUs or Xeon Phis. At the same time, computational scientists face the question of which programming model to use to obtain their scientific results. We present the linear algebra library ViennaCL, which is built on top of all three programming models, thus enabling computational scientists to interface to a single library, yet obtain high performance for all three hardware types. Since the respective compute backend can be selected at runtime, one can seamlessly switch between different hardware types without the need for error-prone and time-consuming recompilation steps.We present new benchmark results for sparse linear algebra operations in ViennaCL, complementing results for the dense linear algebra operations in ViennaCL reported in earlier work. Comparisons with vendor-libraries show that ViennaCL provides better overall performance for sparse matrix-vector and sparse matrix-matrix products. Additional benchmark results for pipelined iterative solvers with kernel fusion and preconditioners identify the respective sweet spots for CPUs, Xeon Phis, and GPUs.

show abstract

supporting

confidence: 85%

Section: Device Databasementioning

confidence: 99%

Section: Device Databasementioning

confidence: 99%

See 1 more Smart Citation

ViennaCL---Linear Algebra Library for Multi- and Many-Core Architectures

Rupp¹,

Tillet²,

Rudolf³

et al. 2016

SIAM J. Sci. Comput.

Self Cite

106

View full text Add to dashboard Cite

show abstract

“…Rupp et al perform an extensive intervendor and intravendor performance portability investigation of OpenCL using miniature linear algebra kernels. Concentrating on structured grid codes, McIntosh‐Smith et al used 3 benchmarks, including the mini app CloverLeaf, to investigate the performance portability of OpenCL across a number of devices.…”

Section: Related Workmentioning

confidence: 99%

“…CloverLeaf has also been used to investigate the performance of the OPS DSL. 26,27 Rupp et al 28 CPU with respect to microscopy image analysis. They concluded that the devices had a significant variance between particular operations, exposing some preference for particular operations.…”

Section: Related Workmentioning

confidence: 99%

Assessing the performance portability of modern parallel programming models using TeaLeaf

Martineau

McIntosh–Smith

Gaudin

2017

Concurrency and Computation

View full text Add to dashboard Cite

Summary In this work, we evaluate several emerging parallel programming models: Kokkos, RAJA, OpenACC, and OpenMP 4.0, against the mature CUDA and OpenCL APIs. Each model has been used to port Tealeaf, a miniature proxy application, or mini app, that solves the heat conduction equation and belongs to the Mantevo Project. We find that the best performance is achieved with architecture‐specific implementations but that, in many cases, the performance portable models are able to solve the same problems to within a 5% to 30% performance penalty. While the models expose varying levels of complexity to the developer, they all achieve reasonable performance with this application. As such, if this small performance penalty is permissible for a problem domain, we believe that productivity and development complexity can be considered the major differentiators when choosing a modern parallel programming model to develop applications like Tealeaf.

show abstract

An Evaluation of Emerging Many-Core Parallel Programming Models

Martineau

McIntosh–Smith

Boulton

et al. 2016

Proceedings of the 7th International Workshop on Programming Models and Applications for Multicores and Manycores

View full text Add to dashboard Cite

In this work we directly evaluate several emerging parallel programming models: Kokkos, RAJA, OpenACC, and OpenMP 4.0, against the mature CUDA and OpenCL APIs. Each model has been used to port TeaLeaf, a miniature proxy application, or miniapp, that solves the heat conduction equation, and belongs to the Mantevo suite of applications. We find that the best performance is achieved with device-tuned implementations but that, in many cases, the performance portable models are able to solve the same problems to within a 5-20% performance penalty. The models expose varying levels of complexity to the developer, and they all present reasonable performance. We believe that complexity will become the major influencer in the long-term adoption of such models.

show abstract

Performance portability study of linear algebra kernels in OpenCL

Cited by 6 publications

References 12 publications

ViennaCL---Linear Algebra Library for Multi- and Many-Core Architectures

ViennaCL---Linear Algebra Library for Multi- and Many-Core Architectures

Assessing the performance portability of modern parallel programming models using TeaLeaf

An Evaluation of Emerging Many-Core Parallel Programming Models

Contact Info

Product

Resources

About