2019
DOI: 10.1177/1094342019832968
|View full text |Cite
|
Sign up to set email alerts
|

Hierarchical approach for deriving a reproducible unblocked LU factorization

Abstract: We propose a reproducible variant of the unblocked LU factorization for graphics processor units (GPUs).For this purpose, we build upon Level-1/2 BLAS kernels that deliver correctly-rounded and reproducible results for the dot (inner) product, vector scaling, and the matrix-vector product. In addition, we draw a strategy to enhance the accuracy of the triangular solve via iterative refinement. Following a bottom-up approach, we finally construct a reproducible unblocked implementation of the LU factorization f… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
2
1

Citation Types

0
8
0

Year Published

2019
2019
2023
2023

Publication Types

Select...
5

Relationship

3
2

Authors

Journals

citations
Cited by 6 publications
(8 citation statements)
references
References 24 publications
0
8
0
Order By: Relevance
“…While ExSUM covers wide range of architectures as well as distributed-memory clusters, the other routines primarily target GPUs. Exploiting the modular and hierarchical structure of linear algebra algorithms, the ExBLAS approach was applied to construct reproducible LU factorizations with partial pivoting [8].…”
Section: Related Workmentioning
confidence: 99%
See 1 more Smart Citation
“…While ExSUM covers wide range of architectures as well as distributed-memory clusters, the other routines primarily target GPUs. Exploiting the modular and hierarchical structure of linear algebra algorithms, the ExBLAS approach was applied to construct reproducible LU factorizations with partial pivoting [8].…”
Section: Related Workmentioning
confidence: 99%
“…The later can also be viewed as switching to fixed-precision computations. Additionally, the bit-wise reproducibility can get costly with the overhead of at least 8 % for parallel reduction [6,7], up to 2x-4x for matrix-vector product [8], and more than 10x for matrix-matrix multiplication [9].…”
Section: Introductionmentioning
confidence: 99%
“…Our basic assumption is that, if these elementary functions are reproducible, then all algorithms and simulations implemented with them are reproducible. This assumption follows our theoretical and practical studies [37] of the unblocked LU factorization with partial pivoting, which underneath is entirely build upon the BLAS routines. The first step to realize our goal incorporates the correctly rounded and reproducible parallel reduction from the ExBLAS library into Feltor.…”
Section: Reproducibility In Feltormentioning
confidence: 99%
“…While ExSUM covers wide range of architectures as well as distributed-memory clusters, the other routines primarily target GPUs. Exploiting the modular and hierarchical structure of linear algebra algorithms, the ExBLAS approach was applied to construct reproducible LU factorizations with partial pivoting Iakymchuk et al (2019b).…”
Section: Related Workmentioning
confidence: 99%
“…These modifications are necessary to preserve every bit of information (both result and error) Collange et al (2015) or, alternatively, to cut off some parts of the data and operate on the remaining most significant parts Mukunoki et al (2020); Demmel and Nguyen (2015). Furthermore, the bit-wise reproducibility can become expensive with the overhead of at least 8% for parallel reduction Collange et al (2015); Demmel and Nguyen (2015), up to 2x–4x for matrix-vector product Iakymchuk et al (2019b), and more than 10x for matrix–matrix multiplication Iakymchuk et al (2016). In this paper, we aim to revisit reproducibility and raise its appeal through reducing its negative impact on performance and minimizing changes to both the algorithm and its building blocks.…”
Section: Introductionmentioning
confidence: 99%