2015
DOI: 10.1002/cpe.3516
|View full text |Cite
|
Sign up to set email alerts
|

Experiences in autotuning matrix multiplication for energy minimization on GPUs

Abstract: In this paper, we report extensive results and analysis of autotuning the computationally intensive graphics processing units kernel for dense matrix-matrix multiplication in double precision. In contrast to traditional autotuning and/or optimization for runtime performance only, we also take the energy efficiency into account. For kernels achieving equal performance, we show significant differences in their energy balance. We also identify the memory throughput as the most influential metric that trades off p… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
10
0

Year Published

2017
2017
2023
2023

Publication Types

Select...
3
3
2

Relationship

3
5

Authors

Journals

citations
Cited by 19 publications
(10 citation statements)
references
References 29 publications
0
10
0
Order By: Relevance
“…1 Consider the architecture of the number one system on the list, the Summit supercomputer at the Oak Ridge Leadership Computing Facility (OLCF). 2 Summit contains three NVIDIA V100 GPUs per each POWER9 CPU. The peak double-precision floating-point performance of the CPU is 22 (cores) × 24.56 GF LOPS = 540.32 GF LOPS.…”
Section: Motivationmentioning
confidence: 99%
See 2 more Smart Citations
“…1 Consider the architecture of the number one system on the list, the Summit supercomputer at the Oak Ridge Leadership Computing Facility (OLCF). 2 Summit contains three NVIDIA V100 GPUs per each POWER9 CPU. The peak double-precision floating-point performance of the CPU is 22 (cores) × 24.56 GF LOPS = 540.32 GF LOPS.…”
Section: Motivationmentioning
confidence: 99%
“…While in the previous sections we used the simplistic gemm kernel from the CUDA Programming Guide, here we are using a highly parametrized gemm kernel developed in the course of our past autotuning efforts [2,9,10]. It is based on a fairly standard approach, where values of C are accumulated in registers, while values of A and B are streamed through shared memory in thin stripes.…”
Section: Kernelmentioning
confidence: 99%
See 1 more Smart Citation
“…4. Based on performance and energy efficiency, a tradeoff between the two metrics [47] can be determined as follows:…”
Section: ]mentioning
confidence: 99%
“…An important aspect of the proposed design is the reliance on the matrix-matrix multiplications (GEMM) kernel in almost all BLAS routines. This approach not only reduces the code base required for the development, but also achieves high performance for all routines, thanks to the GEMM kernel being a common target for continuing research, development, and tuning [21] [13] [3]. In fact, any performance improvement done to the GEMM kernel would automatically propagate to almost all other routines in Level-3 BLAS.…”
Section: Introductionmentioning
confidence: 99%