GHOST: Building Blocks for High Performance Sparse Linear Algebra on Heterogeneous Systems

Kreutzer, Moritz; Thies, Jonas; Röhrig-Zöllner, Melven; Pieper, Andreas; Shahzad, Faisal; Galgon, Martin; Basermann, Achim; Fehske, Holger; Hager, Georg; Wellein, Gerhard

doi:10.1007/s10766-016-0464-z

Cited by 40 publications

(48 citation statements)

References 45 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…6: the application of the polynomial filter. A key feature of our implementation is the use of sparse matrix multiplevector multiplication (spMMVM) as provided by the GHOST library [29], where the sparse matrix is applied simultaneously to several vectors. As we demonstrated previously for the KPM [6] and a block Jacobi-Davidson algorithm [39] the reduction of memory traffic in spMMVM can lead to significant performance gains over multiple independent spMVMs, where the matrix has to be reloaded from memory repeatedly.…”

Section: Parallel Implementation and Performance Engineeringmentioning

confidence: 99%

See 1 more Smart Citation

High-performance implementation of Chebyshev filter diagonalization for interior eigenvalue computations

Pieper

Kreutzer

Alvermann

et al. 2016

Journal of Computational Physics

Self Cite

View full text Add to dashboard Cite

We study Chebyshev filter diagonalization as a tool for the computation of many interior eigenvalues of very large sparse symmetric matrices. In this technique the subspace projection onto the target space of wanted eigenvectors is approximated with filter polynomials obtained from Chebyshev expansions of window functions. After the discussion of the conceptual foundations of Chebyshev filter diagonalization we analyze the impact of the choice of the damping kernel, search space size, and filter polynomial degree on the computational accuracy and effort, before we describe the necessary steps towards a parallel high-performance implementation. Because Chebyshev filter diagonalization avoids the need for matrix inversion it can deal with matrices and problem sizes that are presently not accessible with rational function methods based on direct or iterative linear solvers. To demonstrate the potential of Chebyshev filter diagonalization for large-scale problems of this kind we include as an example the computation of the 10 2 innermost eigenpairs of a topological insulator matrix with dimension 10 9 derived from quantum physics applications.

show abstract

Section: Parallel Implementation and Performance Engineeringmentioning

confidence: 99%

“…In Sec. 4 we describe the main performance engineering steps for our GHOST [29]-based ChebFD implementation, which we use for the large-scale application studies in Sec. 5.…”

Section: Introductionmentioning

confidence: 99%

High-performance implementation of Chebyshev filter diagonalization for interior eigenvalue computations

Pieper

Kreutzer

Alvermann

et al. 2016

Journal of Computational Physics

Self Cite

View full text Add to dashboard Cite

show abstract

“…Recently, Block Krylov methods are receiving an increasing attention in the HPC field [4,1,27,36,30]. They appear to be well suited for modern computers' architectures with a high level of parallelism because they allow to reduce the number of global synchronizations, while also featuring a higher arithmetic intensity at the cost of some extra computations.…”

Section: Block Krylov Methodsmentioning

confidence: 99%

“…Given n, t such that t n, we denote V, W tall and skinny matrices of size n × t whose rows are distributed among the processors, and α is a matrix of size t × t replicated on the P processors. Following [30], it is possible to decompose the iterations of ECG (and more generally block CG) into the following kernels:…”

Section: Cost Analysis Of Ecgmentioning

confidence: 99%

Scalable Linear Solvers Based on Enlarged Krylov Subspaces with Dynamic Reduction of Search Directions

Grigori¹,

Tissot²

2019

SIAM J. Sci. Comput.

View full text Add to dashboard Cite

Krylov methods are widely used for solving large sparse linear systems of equations. On distributed architectures, their performance is limited by the communication needed at each iteration of the algorithm. In this paper, we study the use of so-called enlarged Krylov subspaces for reducing the number of iterations, and therefore the overall communication, of Krylov methods. In particular, we consider a reformulation of the Conjugate Gradient method using these enlarged Krylov subspaces: the enlarged Conjugate Gradient method. We present the parallel design of two variants of the enlarged Conjugate Gradient method as well as their corresponding dynamic versions where the number of search directions is dynamically reduced during the iterations. For a linear elasticity problem with heterogeneous coefficients, using a block Jacobi preconditioner, we show that this implementation scales up to 16, 384 cores, and is up to 5.7 times faster than the PETSc implementation of PCG.

show abstract

“…In addition to the standard types, the CRAFT library was endowed with support for GHOST sparse matrix data types [18], Phist sparse matrix data types [21], and Intel MKL complex data types. These extensions are part of the downloadable code [12].…”

Section: Additional Cr Extensionsmentioning

confidence: 99%

CRAFT: A Library for Easier Application-Level Checkpoint/Restart and Automatic Fault Tolerance

Shahzad

Thies

Kreutzer

et al. 2019

IEEE Trans. Parallel Distrib. Syst.

Self Cite

View full text Add to dashboard Cite

In order to efficiently use the future generations of supercomputers, fault tolerance and power consumption are two of the prime challenges anticipated by the High Performance Computing (HPC) community. Checkpoint/Restart (CR) has been and still is the most widely used technique to deal with hard failures. Application-level CR is the most effective CR technique in terms of overhead efficiency but it takes a lot of implementation effort. This work presents the implementation of our C++ based library CRAFT (Checkpoint-Restart and Automatic Fault Tolerance), which serves two purposes. First, it provides an extendable library that significantly eases the implementation of application-level checkpointing. The most basic and frequently used checkpoint data types are already part of CRAFT and can be directly used out of the box. The library can be easily extended to add more data types. As means of overhead reduction, the library offers a build-in asynchronous checkpointing mechanism and also supports the Scalable Checkpoint/Restart (SCR) library for node level checkpointing. Second, CRAFT provides an easier interface for User-Level Failure Mitigation (ULFM) based dynamic process recovery, which significantly reduces the complexity and effort of failure detection and communication recovery mechanism. By utilizing both functionalities together, applications can write application-level checkpoints and recover dynamically from process failures with very limited programming effort. This work presents the design and use of our library in detail. The associated overheads are thoroughly analyzed using several benchmarks.Thomas Zeiser holds a PhD in Computational Fluid Mechanics from the University of Erlangen-Nuremberg. He is now a senior research scientist in the HPC group of RRZE and is among many other things still interested in lattice Boltzmann methods.Georg Hager holds a PhD in Computational Physics from the University of Greifswald. He has been working with high performance systems since 1995, and is now a senior research scientist in the HPC group at RRZE. Recent research includes architecture-specific optimization for current microprocessors, performance modeling on processor and system levels, and the efficient use of hybrid parallel systems. His daily work encompasses all aspects of user support in HPC such as lectures, tutorials, training, code parallelization, profiling and optimization, and the assessment of novel computer architectures and tools.Gerhard Wellein holds a PhD in Solid State Physics from the University of Bayreuth and is a regular Professor at the Department for Computer Science at University of Erlangen. He heads the HPC group at RRZE and has more than 10 years of experience in teaching HPC techniques to students and scientists from Computational Science and Engineering. His research interests include solving large sparse eigenvalue problems, novel parallelization approaches, performance modeling, and architecture-specific optimization.

show abstract

GHOST: Building Blocks for High Performance Sparse Linear Algebra on Heterogeneous Systems

Cited by 40 publications

References 45 publications

High-performance implementation of Chebyshev filter diagonalization for interior eigenvalue computations

High-performance implementation of Chebyshev filter diagonalization for interior eigenvalue computations

Scalable Linear Solvers Based on Enlarged Krylov Subspaces with Dynamic Reduction of Search Directions

CRAFT: A Library for Easier Application-Level Checkpoint/Restart and Automatic Fault Tolerance

Contact Info

Product

Resources

About