Hierarchical approach for deriving a reproducible unblocked LU factorization

Iakymchuk, Roman; Graillat, Stef; Defour, David; Quintana–Ort́ı, Enrique S.

doi:10.1177/1094342019832968

Cited by 6 publications

(8 citation statements)

References 24 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…While ExSUM covers wide range of architectures as well as distributed-memory clusters, the other routines primarily target GPUs. Exploiting the modular and hierarchical structure of linear algebra algorithms, the ExBLAS approach was applied to construct reproducible LU factorizations with partial pivoting [8].…”

Section: Related Workmentioning

confidence: 99%

See 1 more Smart Citation

Reproducibility strategies for parallel Preconditioned Conjugate Gradient

Iakymchuk

Barreda

Wiesenberger

et al. 2020

Journal of Computational and Applied Mathematics

Self Cite

View full text Add to dashboard Cite

The Preconditioned Conjugate Gradient method is often used in numerical simulations. While being widely used, the solver is also known for its lack of accuracy while computing the residual. In this article, we aim at a twofold goal: enhance the accuracy of the solver but also ensure its reproducibility in a message-passing implementation. We design and employ various strategies starting from the ExBLAS approach (through preserving every bit of information until final rounding) to its more lightweight performance-oriented variant (through expanding the intermediate precision). These algorithmic strategies are reinforced with programmability suggestions to assure deterministic executions. Finally, we verify these strategies on modern HPC systems: both versions deliver reproducible number of iterations, residuals, direct errors, and vector-solutions for the overhead of only 29 % (ExBLAS) and 4 % (lightweight) on 768 processes.

show abstract

Section: Related Workmentioning

confidence: 99%

“…The later can also be viewed as switching to fixed-precision computations. Additionally, the bit-wise reproducibility can get costly with the overhead of at least 8 % for parallel reduction [6,7], up to 2x-4x for matrix-vector product [8], and more than 10x for matrix-matrix multiplication [9].…”

Section: Introductionmentioning

confidence: 99%

Reproducibility strategies for parallel Preconditioned Conjugate Gradient

Iakymchuk

Barreda

Wiesenberger

et al. 2020

Journal of Computational and Applied Mathematics

Self Cite

View full text Add to dashboard Cite

show abstract

“…Our basic assumption is that, if these elementary functions are reproducible, then all algorithms and simulations implemented with them are reproducible. This assumption follows our theoretical and practical studies [37] of the unblocked LU factorization with partial pivoting, which underneath is entirely build upon the BLAS routines. The first step to realize our goal incorporates the correctly rounded and reproducible parallel reduction from the ExBLAS library into Feltor.…”

Section: Reproducibility In Feltormentioning

confidence: 99%

Reproducibility, accuracy and performance of the Feltor code and library on parallel computer architectures

Wiesenberger

Einkemmer

Held

et al. 2019

Computer Physics Communications

Self Cite

View full text Add to dashboard Cite

Feltor is a modular and free scientific software package. It allows developing platform independent code that runs on a variety of parallel computer architectures ranging from laptop CPUs to multi-GPU distributed memory systems. Feltor consists of both a numerical library and a collection of application codes built on top of the library. Its main target are two-and three-dimensional drift-and gyro-fluid simulations with discontinuous Galerkin methods as the main numerical discretization technique.We observe that numerical simulations of a recently developed gyro-fluid model produce non-deterministic results in parallel computations. First, we show how we restore accuracy and bitwise reproducibility algorithmically and programmatically. In particular, we adopt an implementation of the exactly rounded dot product based on long accumulators, which avoids accuracy losses especially in parallel applications. However, reproducibility and accuracy alone fail to indicate correct simulation behaviour. In fact, in the physical model slightly different initial conditions lead to vastly different end states. This behaviour translates to its numerical representation. Pointwise convergence, even in principle, becomes impossible for long simulation times. We briefly discuss alternative methods to ensure the correctness of results like the convergence of reduced physical quantities of interest, ensemble simulations, invariants or reduced simulation times.In a second part, we explore important performance tuning considerations. We identify latency and memory bandwidth as the main performance indicators of our routines. Based on these, we propose a parallel performance model that predicts the execution time of algorithms implemented in Feltor and test our model on a selection of parallel hardware architectures. We are able to predict the execution time with a relative error of less than 25% for problem sizes between 10 −1 and 10 3 MB. Finally, we find that the product of latency and bandwidth gives a minimum array size per compute node to achieve a scaling efficiency above 50% (both strong and weak).

show abstract

Section: Related Workmentioning

confidence: 99%

“…These modifications are necessary to preserve every bit of information (both result and error) Collange et al (2015) or, alternatively, to cut off some parts of the data and operate on the remaining most significant parts Mukunoki et al (2020); Demmel and Nguyen (2015). Furthermore, the bit-wise reproducibility can become expensive with the overhead of at least 8% for parallel reduction Collange et al (2015); Demmel and Nguyen (2015), up to 2x–4x for matrix-vector product Iakymchuk et al (2019b), and more than 10x for matrix–matrix multiplication Iakymchuk et al (2016). In this paper, we aim to revisit reproducibility and raise its appeal through reducing its negative impact on performance and minimizing changes to both the algorithm and its building blocks.…”

Section: Introductionmentioning

confidence: 99%

Reproducibility of parallel preconditioned conjugate gradient in hybrid programming environments

Iakymchuk

Barreda

Graillat³

et al. 2020

The International Journal of High Performance Computing Applica

Self Cite

View full text Add to dashboard Cite

The Preconditioned Conjugate Gradient method is often employed for the solution of linear systems of equations arising in numerical simulations of physical phenomena. While being widely used, the solver is also known for its lack of accuracy while computing the residual. In this article, we propose two algorithmic solutions that originate from the ExBLAS project to enhance the accuracy of the solver as well as to ensure its reproducibility in a hybrid MPI + OpenMP tasks programming environment. One is based on ExBLAS and preserves every bit of information until the final rounding, while the other relies upon floating-point expansions and, hence, expands the intermediate precision. Instead of converting the entire solver into its ExBLAS-related implementation, we identify those parts that violate reproducibility/non-associativity, secure them, and combine this with the sequential executions. These algorithmic strategies are reinforced with programmability suggestions to assure deterministic executions. Finally, we verify these approaches on two modern HPC systems: both versions deliver reproducible number of iterations, residuals, direct errors, and vector-solutions for the overhead of less than 37.7% on 768 cores.

show abstract

Hierarchical approach for deriving a reproducible unblocked LU factorization

Cited by 6 publications

References 24 publications

Reproducibility strategies for parallel Preconditioned Conjugate Gradient

Reproducibility strategies for parallel Preconditioned Conjugate Gradient

Reproducibility, accuracy and performance of the Feltor code and library on parallel computer architectures

Reproducibility of parallel preconditioned conjugate gradient in hybrid programming environments

Contact Info

Product

Resources

About