Communication in task‐parallel ILU‐preconditioned CG solvers using MPI + OmpSs

Aliaga, José I.; Barreda, Maria; Flegar, Goran; Bollhöfer, Matthias; Quintana–Ort́ı, Enrique S.

doi:10.1002/cpe.4280

Cited by 5 publications

(8 citation statements)

References 19 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Code availability. Our solvers utilize functionality from the following libraries: ILUPACK (http://ilupack.tu-bs.de, Bollhöfer, 2020), PARDISO (https://www.pardiso-project.org, Davis et al, 2016), and PETSc (https://www.mcs.anl.gov/petsc, Balay et al, 2019a). PETSc is open source under a BSD-2 license; ILUPACK and PARDISO are closed source and offer complementary academic licenses.…”

Section: Discussionmentioning

confidence: 99%

Pragmatic solvers for 3D Stokes and elasticity problems with heterogeneous coefficients: evaluating modern incomplete LDLT preconditioners

et al. 2020

Self Cite

View full text Add to dashboard Cite

Abstract. The need to solve large saddle point systems within computational Earth sciences is ubiquitous. Physical processes giving rise to these systems include porous flow (the Darcy equations), poroelasticity, elastostatics, and highly viscous flows (the Stokes equations). The numerical solution of saddle point systems is non-trivial since the operators are indefinite. Primary tools to solve such systems are direct solution methods (exact triangular factorization) or approximate block factorization (ABF) preconditioners. While ABF solvers have emerged as the state-of-the-art scalable option, they are invasive solvers requiring splitting of pressure and velocity degrees of freedom, a multigrid hierarchy with tuned transfer operators and smoothers, machinery to construct complex Schur complement preconditioners, and the expertise to select appropriate parameters for a given coefficient regime – they are far from being “black box” solvers. Modern direct solvers, which robustly produce solutions to almost any system, do so at the cost of rapidly growing time and memory requirements for large problems, especially in 3D. Incomplete LDLT (ILDL) factorizations, with symmetric maximum weighted-matching preprocessing, used as preconditioners for Krylov (iterative) methods, have emerged as an efficient means to solve indefinite systems. These methods have been developed within the numerical linear algebra community but have yet to become widely used in applications, despite their practical potential; they can be used whenever a direct solver can, only requiring an assembled operator, yet can offer comparable or superior performance, with the added benefit of having a much lower memory footprint. In comparison to ABF solvers, they only require the specification of a drop tolerance and thus provide an easy-to-use addition to the solver toolkit for practitioners. Here, we present solver experiments employing incomplete LDLT factorization with symmetric maximum weighted-matching preprocessing to precondition operators and compare these to direct solvers and ABF-preconditioned iterative solves. To ensure the comparison study is meaningful for Earth scientists, we utilize matrices arising from two prototypical problems, namely Stokes flow and quasi-static (linear) elasticity, discretized using standard mixed finite-element spaces. Our test suite targets problems with large coefficient discontinuities across non-grid-aligned interfaces, which represent a common challenging-for-solvers scenario in Earth science applications. Our results show that (i) as the coefficient structure is made increasingly challenging, by introducing high contrast and complex topology with a multiple-inclusion benchmark, the ABF solver can break down, becoming less efficient than the ILDL solver before breaking down entirely; (ii) ILDL is robust, with a time to solution that is largely independent of the coefficient topology and mildly dependent on the coefficient contrast; (iii) the time to solution obtained using ILDL is typically faster than that obtained from a direct solve, beyond 105 unknowns; and (iv) ILDL always uses less memory than a direct solve.

show abstract

Section: Discussionmentioning

confidence: 99%

Pragmatic solvers for 3D Stokes and elasticity problems with heterogeneous coefficients: evaluating modern incomplete LDLT preconditioners

et al. 2020

Self Cite

View full text Add to dashboard Cite

show abstract

“…This level is exploited in each node of the cluster using, for example, OpenMP. The analysis in Aliaga et al (2017); Barreda et al (2019) exposes that, in the PCG, a reasonable option is to leverage task-parallelism , which consists in dividing each kernel into a collection of finer-grain operations, or tasks. Then, each thread executes a different task and two consecutive kernels can be executed concurrently avoiding a thread-synchronization point after each kernel, as described next.…”

Section: Algorithm(s)mentioning

confidence: 99%

Reproducibility of parallel preconditioned conjugate gradient in hybrid programming environments

Iakymchuk

Barreda

Graillat³

et al. 2020

The International Journal of High Performance Computing Applica

View full text Add to dashboard Cite

The Preconditioned Conjugate Gradient method is often employed for the solution of linear systems of equations arising in numerical simulations of physical phenomena. While being widely used, the solver is also known for its lack of accuracy while computing the residual. In this article, we propose two algorithmic solutions that originate from the ExBLAS project to enhance the accuracy of the solver as well as to ensure its reproducibility in a hybrid MPI + OpenMP tasks programming environment. One is based on ExBLAS and preserves every bit of information until the final rounding, while the other relies upon floating-point expansions and, hence, expands the intermediate precision. Instead of converting the entire solver into its ExBLAS-related implementation, we identify those parts that violate reproducibility/non-associativity, secure them, and combine this with the sequential executions. These algorithmic strategies are reinforced with programmability suggestions to assure deterministic executions. Finally, we verify these approaches on two modern HPC systems: both versions deliver reproducible number of iterations, residuals, direct errors, and vector-solutions for the overhead of less than 37.7% on 768 cores.

show abstract

“…Our work builds upon a number of previous papers that address the task-parallel implementation of KSMs on multicore architectures and clusters of multicore processors. First, the authors of [1]; proposed a parallel implemen-tation of a CG solver, enhanced with a sophisticated ILUPACK preconditioner, that leverages MPI and OmpSs [13,14] to improve the performance of a pure MPI-based solution; and this approach was then generalized to other types of incomplete LU (ILU)-based preconditioners and communication-reduced variants of CG in [2]. Independently, the authors of [18] presented an iteration-fusing variant of the pipelined CG [9], for multicore processors, that combines a task-parallel re-formulation of the method with a relaxation of the convergence test in order to break the strict barrier between consecutive iterations of the method.…”

Section: Introductionmentioning

confidence: 99%

Iteration-fusing conjugate gradient for sparse linear systems with MPI + OmpSs

et al. 2019

Self Cite

View full text Add to dashboard Cite

In this paper we target the parallel solution of sparse linear systems via iterative Krylov subspace-based method enhanced with a block-Jacobi preconditioner on a cluster of multicore processors. In order to tackle largescale problems, we develop task-parallel implementations of the Preconditioned Conjugate Gradient (PCG) method that improve the interoperability between the MPI (message-pasing interface) and OmpSs programming models. Specifically, we progressively integrate several communication-reduction and iteration-fusing strategies into the initial code, obtaining more efficient versions of the method. For all these implementations, we analyze the communication patterns and perform a comparative analysis of their performance and scalability on a cluster consisting of 32 nodes with 24 cores each. The experimental analysis shows that the techniques described in the paper outperforms the classical method by a margin that varies between 6% and 48%, depending on the evaluation.

show abstract

Communication in task‐parallel ILU‐preconditioned CG solvers using MPI + OmpSs

Cited by 5 publications

References 19 publications

Pragmatic solvers for 3D Stokes and elasticity problems with heterogeneous coefficients: evaluating modern incomplete LDL<sup><i>T</i></sup> preconditioners

Pragmatic solvers for 3D Stokes and elasticity problems with heterogeneous coefficients: evaluating modern incomplete LDL<sup><i>T</i></sup> preconditioners

Reproducibility of parallel preconditioned conjugate gradient in hybrid programming environments

Iteration-fusing conjugate gradient for sparse linear systems with MPI + OmpSs

Contact Info

Product

Resources

About

Communication in task‐parallel ILU‐preconditioned CG solvers using MPI + OmpSs

Cited by 5 publications

References 19 publications

Pragmatic solvers for 3D Stokes and elasticity problems with heterogeneous coefficients: evaluating modern incomplete LDL&lt;sup&gt;&lt;i&gt;T&lt;/i&gt;&lt;/sup&gt; preconditioners

Pragmatic solvers for 3D Stokes and elasticity problems with heterogeneous coefficients: evaluating modern incomplete LDL&lt;sup&gt;&lt;i&gt;T&lt;/i&gt;&lt;/sup&gt; preconditioners

Reproducibility of parallel preconditioned conjugate gradient in hybrid programming environments

Iteration-fusing conjugate gradient for sparse linear systems with MPI + OmpSs

Contact Info

Product

Resources

About

Pragmatic solvers for 3D Stokes and elasticity problems with heterogeneous coefficients: evaluating modern incomplete LDL<sup><i>T</i></sup> preconditioners

Pragmatic solvers for 3D Stokes and elasticity problems with heterogeneous coefficients: evaluating modern incomplete LDL<sup><i>T</i></sup> preconditioners