2016
DOI: 10.1007/978-3-319-43659-3_35
|View full text |Cite
|
Sign up to set email alerts
|

Redesigning Triangular Dense Matrix Computations on GPUs

Abstract: Abstract.A new implementation of the triangular matrix-matrix multiplication (TRMM) and the triangular solve (TRSM) kernels are described on GPU hardware accelerators. Although part of the Level 3 BLAS family, these highly computationally intensive kernels fail to achieve the percentage of the theoretical peak performance on GPUs that one would expect when running kernels with similar surface-to-volume ratio on hardware accelerators, i.e., the standard matrix-matrix multiplication (GEMM). The authors propose a… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
4
1

Citation Types

0
15
0

Year Published

2017
2017
2022
2022

Publication Types

Select...
2
2
1

Relationship

1
4

Authors

Journals

citations
Cited by 10 publications
(15 citation statements)
references
References 12 publications
0
15
0
Order By: Relevance
“…To remedy these losses, current high‐performance implementations rely on an out‐of‐place (OOP) design of these aforementioned kernels, which help exposing more parallelism by weakening the thread synchronizations encountered during the Write After Read (WAR) and Read After Write (RAW) data dependency hazards for TRMM and TRSM kernels, respectively. In particular, we addressed in Charara et al 4 resulting drawbacks, due to the OOP design as opposed to the in‐place (IP) design, in the context of a single GPU platform: (1) extra memory allocation, thus limiting the size of problems achievable in scarce memory resources, especially on GPUs, (2) extra data movement, causing extra data transfer time, (3) inefficient use of caches that need to serve one extra matrix, thus increasing cache misses, and most importantly, (4) violation of the standard legacy BLAS API.…”
Section: Introductionmentioning
confidence: 59%
See 4 more Smart Citations
“…To remedy these losses, current high‐performance implementations rely on an out‐of‐place (OOP) design of these aforementioned kernels, which help exposing more parallelism by weakening the thread synchronizations encountered during the Write After Read (WAR) and Read After Write (RAW) data dependency hazards for TRMM and TRSM kernels, respectively. In particular, we addressed in Charara et al 4 resulting drawbacks, due to the OOP design as opposed to the in‐place (IP) design, in the context of a single GPU platform: (1) extra memory allocation, thus limiting the size of problems achievable in scarce memory resources, especially on GPUs, (2) extra data movement, causing extra data transfer time, (3) inefficient use of caches that need to serve one extra matrix, thus increasing cache misses, and most importantly, (4) violation of the standard legacy BLAS API.…”
Section: Introductionmentioning
confidence: 59%
“…KAUST BLAS (KBLAS) is an open‐source library that provides highly optimized implementations for a subset of BLAS routines on NVIDIA GPUs as well as x86 architectures . In particular, the authors have already demonstrated significant performance gains for IP TRSM and TRMM against cuBLAS IP and MAGMA OOP implementations on a single NVIDIA GPU . They use a recursive formulation of TRSM and TRMM that converts most of the computations into GEMM operations, while optimizing the data access pattern.…”
Section: Related Workmentioning
confidence: 99%
See 3 more Smart Citations