Toward Performance-Portable PETSc for GPU-based Exascale Systems

Mills, Richard Tran; Adams, Mark F.; Balay, Satish; Brown, Jed; Dener, Alp; Knepley, Matthew G.; Kruger, Scott; Morgan, Hannah; Munson, Todd; Rupp, Karl; Barry, Smith; Zampini, Stefano; Zhang, Hong; Zhang, Junchao

doi:10.48550/arxiv.2011.00715

Cited by 4 publications

(7 citation statements)

References 15 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…For the preconditioning step, we consider a smoothed aggregation algebraic multigrid method constructed on the matrix C, using a diagonally preconditioned Chebyshev method as a smoother. The setup of the preconditioner runs partly on the CPU and partly on the GPU, while the Krylov solver, including the preconditioner application, runs entirely on the GPU [42,56].…”

Section: Strong Scalabilitymentioning

confidence: 99%

H2Opus: A distributed-memory multi-GPU software package for non-local operators

Zampini¹,

Boukaram²,

Turkiyyah³

et al. 2021

Preprint

Self Cite

View full text Add to dashboard Cite

Hierarchical H 2 -matrices are asymptotically optimal representations for the discretizations of non-local operators such as those arising in integral equations or from kernel functions. Their O(N ) complexity in both memory and operator application makes them particularly suited for largescale problems. As a result, there is a need for software that provides support for distributed operations on these matrices to allow large-scale problems to be represented. In this paper, we present high-performance, distributed-memory GPU-accelerated algorithms and implementations for matrix-vector multiplication and matrix recompression of hierarchical matrices in the H 2 format.The algorithms are a new module of H2Opus, a performance-oriented package that supports a broad variety of H 2 matrix operations on CPUs and GPUs. Performance in the distributed GPU setting is achieved by marshaling the tree data of the hierarchical matrix representation to allow batched kernels to be executed on the individual GPUs. MPI is used for inter-process communication. We optimize the communication data volume and hide much of the communication cost with local compute phases of the algorithms. Results show near-ideal scalability up to 1024 NVIDIA V100 GPUs on Summit, with performance exceeding 2.3 Tflop/s/GPU for the matrix-vector multiplication, and 670 Gflops/s/GPU for matrix compression, which involves batched QR and SVD operations.We illustrate the flexibility and efficiency of the library by solving a 2D variable diffusivity integral fractional diffusion problem with an algebraic multigridpreconditioned Krylov solver and demonstrate scalability up to 16M degrees of freedom problems on 64 GPUs.

show abstract

Section: Strong Scalabilitymentioning

confidence: 99%

H2Opus: A distributed-memory multi-GPU software package for non-local operators

Zampini¹,

Boukaram²,

Turkiyyah³

et al. 2021

Preprint

Self Cite

View full text Add to dashboard Cite

show abstract

“…A symmetric address is the address of a symmetric object on the local PE, plus an offset if needed. The code below allocates two symmetric double arrays src [1] and dst [2], and every PE puts a double from its src[0] to the next PE's dst [1].…”

Section: Stream-aware Nvshmem Supportmentioning

confidence: 99%

“…In [1] we discuss the plans and progress in adapting the Portable, Extensible Toolkit for Scientific Computation and Toolkit for Advanced Optimization [2] (PETSc) to CPU-GPU systems. This paper focuses specifically on the plans for managing the network and intra-node communication within PETSc.…”

Section: Introductionmentioning

confidence: 99%

“…Though we focus our discussion on GPUs in this paper, PetscSF also supports CPU-based systems with high performance. Most of our experimental work has been done on the OLCF IBM/NVIDIA Summit system that serves as a surrogate for the future exascale systems; however, as discussed in [1], the PetscSF design and work is focused on the emerging exascale systems.…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

The PetscSF Scalable Communication Layer

Zhang

Brown

Balay

et al. 2021

Preprint

Self Cite

View full text Add to dashboard Cite

PetscSF, the communication component of the Portable, Extensible Toolkit for Scientific Computation (PETSc), is being used to gradually replace the direct MPI calls in the PETSc library. PetscSF provides a simple application programming interface (API) for managing common communication patterns in scientific computations by using a star-forest graph representation. PetscSF supports several implementations whose selection is based on the characteristics of the application or the target architecture. An efficient and portable model for network and intra-node communication is essential for implementing large-scale applications. The Message Passing Interface, which has been the de facto standard for distributed memory systems, has developed into a large complex API that does not yet provide high performance on the emerging heterogeneous CPU-GPU-based exascale systems. In this paper, we discuss the design of PetscSF, how it can overcome some difficulties of working directly with MPI with GPUs, and we demonstrate its performance, scalability, and novel features.

show abstract

“…Iterative linear solvers are often preferred for solving large-scale linear systems, as they can take advantage of problem structure such as sparsity or bandedness, require inexpensive floating point operations, and can be readily paired with preconditioning techniques [19, see preface]. While such iterative linear solvers as Conjugate Gradients (CG) and the Generalized Minimal Residual method (GM-RES) are still dominant solvers in practice, randomized row-action [8,1,14,23] and column-action iterative solvers [10,25] have been growing in interest for several reasons: they (usually) require very few floating point operations per iteration [5,3]; they have low-memory footprints [9]; they can readily be composed with randomization techniques to quickly produce approximate solutions [23,10,24,6,11,2,7,17]; they can be used for solving systems constructed in a streaming fashion (e.g., [15]), which supports emerging computing paradigms (e.g., [13]); and, just like the more popular iterative Krylov solvers, they can be parallelized, preconditioned or combined with other linear solvers [20,16,4,18];…”

Section: Introductionmentioning

confidence: 99%

Convergence of Adaptive, Randomized, Iterative Linear Solvers

Patel¹,

Jahangoshahi²,

Maldonado³

2021

Preprint

View full text Add to dashboard Cite

Deterministic and randomized, row-action and column-action linear solvers have become increasingly popular owing to their simplicity, low computational and memory complexities, and ease of composition with other techniques. Moreover, in order to achieve high-performance, such solvers must often be adapted to the given problem structure and to the hardware platform on which the problem will be solved. Unfortunately, determining whether such adapted solvers will converge to a solution has required equally unique analyses. As a result, adapted, reliable solvers are slow to be developed and deployed. In this work, we provide a general set of assumptions under which such adapted solvers are guaranteed to converge with probability one, and provide worst case rates of convergence. As a result, we can provide practitioners with guidance on how to design highly adapted, randomized or deterministic, row-action or column-action linear solvers that are also guaranteed to converge.

show abstract

Toward Performance-Portable PETSc for GPU-based Exascale Systems

Cited by 4 publications

References 15 publications

H2Opus: A distributed-memory multi-GPU software package for non-local operators

H2Opus: A distributed-memory multi-GPU software package for non-local operators

The PetscSF Scalable Communication Layer

Convergence of Adaptive, Randomized, Iterative Linear Solvers

Contact Info

Product

Resources

About