Experiences in autotuning matrix multiplication for energy minimization on GPUs

Anzt, Hartwig; Haugen, Blake; Kurzak, Jakub; Łuszczek, Piotr; Dongarra, Jack

doi:10.1002/cpe.3516

Cited by 19 publications

(10 citation statements)

References 29 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…1 Consider the architecture of the number one system on the list, the Summit supercomputer at the Oak Ridge Leadership Computing Facility (OLCF). 2 Summit contains three NVIDIA V100 GPUs per each POWER9 CPU. The peak double-precision floating-point performance of the CPU is 22 (cores) × 24.56 GF LOPS = 540.32 GF LOPS.…”

Section: Motivationmentioning

confidence: 99%

“…While in the previous sections we used the simplistic gemm kernel from the CUDA Programming Guide, here we are using a highly parametrized gemm kernel developed in the course of our past autotuning efforts [2,9,10]. It is based on a fairly standard approach, where values of C are accumulated in registers, while values of A and B are streamed through shared memory in thin stripes.…”

Section: Kernelmentioning

confidence: 99%

“…A is streamed through shared memory in stripes of size blk_m × blk_k and B is streamed in stripes of size blk_k × blk_n. For the lack of space, we refer the readers to the literature for more details [2,9,10].…”

Section: Kernelmentioning

confidence: 99%

See 2 more Smart Citations

Massively Parallel Automated Software Tuning

Kurzak

Tsai

Gates

et al. 2019

Proceedings of the 48th International Conference on Parallel Processing

Self Cite

View full text Add to dashboard Cite

This article presents an implementation of a distributed autotuning engine developed as part of the Bench-testing OpenN Software Autotuning Infrastructure project. The system is geared towards performance optimization of computational kernels for graphics processing units, and allows for the deployment of vast autotuning sweeps to massively parallel machines. The software implements dynamic work scheduling to distributed-memory resources and takes advantage of multithreading for parallel compilation and dispatches kernel launches to multiple accelerators. This paper lays out the main design principles of the system and discusses the basic mechanics of the initial implementation. Preliminary performance results are presented, encountered challenges are discussed, and the future directions are outlined. CCS CONCEPTS • Software and its engineering → Massively parallel systems.

show abstract

Section: Motivationmentioning

confidence: 99%

Section: Kernelmentioning

confidence: 99%

See 1 more Smart Citation

Massively Parallel Automated Software Tuning

Kurzak

Tsai

Gates

et al. 2019

Proceedings of the 48th International Conference on Parallel Processing

Self Cite

View full text Add to dashboard Cite

show abstract

“…4. Based on performance and energy efficiency, a tradeoff between the two metrics [47] can be determined as follows:…”

Section: ]mentioning

confidence: 99%

Toward Exascale Computing Systems: An Energy Efficient Massive Parallel Computational Model

Ashraf¹,

Eassa²,

Albeshri³

et al. 2018

ijacsa

View full text Add to dashboard Cite

The emerging Exascale supercomputing system expected till 2020 will unravel many scientific mysteries. This extreme computing system will achieve a thousand-fold increase in computing power compared to the current petascale computing system. The forthcoming system will assist system designers and development communities in navigating from traditional homogeneous to the heterogeneous systems that will be incorporated into powerful accelerated GPU devices beside traditional CPUs. For achieving ExaFlops (10 18 calculations per second) performance through the ultrascale and energy-efficient system, the current technologies are facing several challenges. Massive parallelism is one of these challenges, which requires a novel energy-efficient parallel programming (PP) model for providing the massively parallel performance. In the current study, a new parallel programming model has been proposed, which is capable of achieving massively parallel performance through coarse-grained and fine-grained parallelism over internode and intra-node architectural-based processing. The suggested model is a tri-level hybrid of MPI, OpenMP and CUDA that is computable over a heterogeneous system with the collaboration of traditional CPUs and energy-efficient GPU devices. Furthermore, the developed model has been demonstrated by implementing dense matrix multiplication (DMM). The proposed model is considered an initial and leading model for obtaining massively parallel performance in an Exascale computing system.

show abstract

“…An important aspect of the proposed design is the reliance on the matrix-matrix multiplications (GEMM) kernel in almost all BLAS routines. This approach not only reduces the code base required for the development, but also achieves high performance for all routines, thanks to the GEMM kernel being a common target for continuing research, development, and tuning [21] [13] [3]. In fact, any performance improvement done to the GEMM kernel would automatically propagate to almost all other routines in Level-3 BLAS.…”

Section: Introductionmentioning

confidence: 99%

Novel HPC techniques to batch execution of many variable size BLAS computations on GPUs

Abdelfattah

Haidar

Tomov

et al. 2017

Proceedings of the International Conference on Supercomputing

Self Cite

View full text Add to dashboard Cite

This paper presents a software framework for solving large numbers of relatively small matrix problems using GPUs. Our approach combines novel and existing HPC techniques to methodically apply performance analysis, kernel design, low-level optimizations, and autotuning to exceed in performance proprietary vendor libraries. As a case study, we discuss the fundamental matrix operations defined by the Basic Linear Algebra Subprograms (BLAS) standard. This case study is significantly important for wide range of applications, including astrophysics, tensor contractions, sparse direct solvers, and others. We provide a generic design that is capable of dealing with problems of different sizes, and handling the irregularity arising from size variations. The developed solution adopts a batched computation scheme, where the same operation is concurrently applied to all matrices within a single computational kernel. The paper discusses the data layout, kernel design, and optimization techniques. We also propose a design scheme that is centralized around matrix-matrix multiplication (GEMM) kernel, so that any improvement on this particular kernel propagates automatically to other routines. Our performance results show significant speedups using a Pascal generation GPU (Tesla P100) against state-of-the-art solutions using cuBLAS, as well as against two 10-core Haswell CPUs running the MKL library. This work is part of the MAGMA library. CCS CONCEPTS •Computing methodologies →Massively parallel algorithms;

show abstract

Experiences in autotuning matrix multiplication for energy minimization on GPUs

Cited by 19 publications

References 29 publications

Massively Parallel Automated Software Tuning

Massively Parallel Automated Software Tuning

Toward Exascale Computing Systems: An Energy Efficient Massive Parallel Computational Model

Novel HPC techniques to batch execution of many variable size BLAS computations on GPUs

Contact Info

Product

Resources

About