Redesigning Triangular Dense Matrix Computations on GPUs

Charara, Ali; Ltaief, Hatem; Keyes, David E.

doi:10.1007/978-3-319-43659-3_35

Cited by 10 publications

(15 citation statements)

References 12 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…To remedy these losses, current high‐performance implementations rely on an out‐of‐place (OOP) design of these aforementioned kernels, which help exposing more parallelism by weakening the thread synchronizations encountered during the Write After Read (WAR) and Read After Write (RAW) data dependency hazards for TRMM and TRSM kernels, respectively. In particular, we addressed in Charara et al 4 resulting drawbacks, due to the OOP design as opposed to the in‐place (IP) design, in the context of a single GPU platform: (1) extra memory allocation, thus limiting the size of problems achievable in scarce memory resources, especially on GPUs, (2) extra data movement, causing extra data transfer time, (3) inefficient use of caches that need to serve one extra matrix, thus increasing cache misses, and most importantly, (4) violation of the standard legacy BLAS API.…”

Section: Introductionmentioning

confidence: 59%

“…KAUST BLAS (KBLAS) is an open‐source library that provides highly optimized implementations for a subset of BLAS routines on NVIDIA GPUs as well as x86 architectures . In particular, the authors have already demonstrated significant performance gains for IP TRSM and TRMM against cuBLAS IP and MAGMA OOP implementations on a single NVIDIA GPU . They use a recursive formulation of TRSM and TRMM that converts most of the computations into GEMM operations, while optimizing the data access pattern.…”

Section: Related Workmentioning

confidence: 99%

“…The kernels from the latter BLAS family have often been taken for granted in extracting hardware performance due to their high arithmetic intensity, which usually results in performing matrix‐matrix multiplication operations. In fact, it turns out that the dense triangular Level 3 BLAS kernels, ie, the triangular matrix‐matrix multiplication ( T R M M : B = α A B ), and the triangular solve with multiple right‐hand sides ( T R S M : A X = α B ), suffer from performance losses, as demonstrated in a previous work from the authors . To remedy these losses, current high‐performance implementations rely on an out‐of‐place (OOP) design of these aforementioned kernels, which help exposing more parallelism by weakening the thread synchronizations encountered during the Write After Read (WAR) and Read After Write (RAW) data dependency hazards for TRMM and TRSM kernels, respectively.…”

Section: Introductionmentioning

confidence: 99%

“…In this paper, we extend the work in Charara et al—in which we demonstrated the effectiveness of recursive formulations in enhancing the performance of these kernels—and present a new high‐performance framework for dense TRMM and TRSM BLAS kernels on various hardware architectures. We further enhance the performance of these triangular BLAS kernels on a single GPU by implementing customized CUDA kernels for TRMM and TRSM, which are called at the bottom of the recursion.…”

Section: Introductionmentioning

confidence: 99%

“…Section 2 provides a literature review for the kernels of interest. Section 3 recalls the uniform design strategy for recursive high performance TRMM and TRSM BLAS kernels, as detailed in Charara et al, and motivates the need for further enhancements. Section 4 describes the implementation of the proposed customized kernels, on single and multiple GPUs, as well as the design portability on Intel x86 hardware against various BLAS libraries.…”

Section: Introductionmentioning

confidence: 99%

See 4 more Smart Citations

A framework for dense triangular matrix kernels on various manycore architectures

Charara

Keyes

Ltaief

2017

Concurrency and Computation

Self Cite

View full text Add to dashboard Cite

SummaryWe present a new high-performance framework for dense triangular Basic Linear Algebra Subroutines (BLAS) kernels, ie, triangular matrix-matrix multiplication (TRMM) and triangular solve (TRSM), on various manycore architectures. This is an extension of a previous work on a single GPU by the same authors, presented at the EuroPar'16 conference, in which we demonstrated the effectiveness of recursive formulations in enhancing the performance of these kernels. In this paper, the performance of triangular BLAS kernels on a single GPU is further enhanced by implementing customized in-place CUDA kernels for TRMM and TRSM, which are called at the bottom of the recursion. In addition, a multi-GPU implementation of TRMM and TRSM is proposed and we show an almost linear performance scaling, as the number of GPUs increases.Finally, the algorithmic recursive formulation of these triangular BLAS kernels is in fact oblivious to the targeted hardware architecture. We, therefore, port these recursive kernels to homoge- In this paper, we extend the work in Charara et al 8 -in which we demonstrated the effectiveness of recursive formulations in enhancing the performance of these kernels-and present a new

show abstract

Section: Introductionmentioning

confidence: 59%

Section: Related Workmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 3 more Smart Citations

A framework for dense triangular matrix kernels on various manycore architectures

Charara

Keyes

Ltaief

2017

Concurrency and Computation

Self Cite

View full text Add to dashboard Cite

show abstract

Efficient Triangular Matrix Vector Multiplication on the GPU

Inoue

Tokura

Nakano

et al. 2020

Parallel Processing and Applied Mathematics

View full text Add to dashboard Cite

Hierarchical algorithms on hierarchical architectures

Keyes

Ltaief

Turkiyyah

2020

Phil. Trans. R. Soc. A.

View full text Add to dashboard Cite

A traditional goal of algorithmic optimality, squeezing out flops, has been superseded by evolution in architecture. Flops no longer serve as a reasonable proxy for all aspects of complexity. Instead, algorithms must now squeeze memory, data transfers, and synchronizations, while extra flops on locally cached data represent only small costs in time and energy. Hierarchically low-rank matrices realize a rarely achieved combination of optimal storage complexity and high-computational intensity for a wide class of formally dense linear operators that arise in applications for which exascale computers are being constructed. They may be regarded as algebraic generalizations of the fast multipole method. Methods based on these hierarchical data structures and their simpler cousins, tile low-rank matrices, are well proportioned for early exascale computer architectures, which are provisioned for high processing power relative to memory capacity and memory bandwidth. They are ushering in a renaissance of computational linear algebra. A challenge is that emerging hardware architecture possesses hierarchies of its own that do not generally align with those of the algorithm. We describe modules of a software toolkit, hierarchical computations on manycore architectures, that illustrate these features and are intended as building blocks of applications, such as matrix-free higher-order methods in optimization and large-scale spatial statistics. Some modules of this open-source project have been adopted in the software libraries of major vendors. This article is part of a discussion meeting issue ‘Numerical algorithms for high-performance computational science’.

show abstract

Redesigning Triangular Dense Matrix Computations on GPUs

Cited by 10 publications

References 12 publications

A framework for dense triangular matrix kernels on various manycore architectures

A framework for dense triangular matrix kernels on various manycore architectures

Efficient Triangular Matrix Vector Multiplication on the GPU

Hierarchical algorithms on hierarchical architectures

Contact Info

Product

Resources

About