Moritz Kreutzer scite author profile

Sparse matrix-vector multiplication (spMVM) is the most time-consuming kernel in many numerical algorithms and has been studied extensively on all modern processor and accelerator architectures. However, the optimal sparse matrix data storage format is highly hardware-specific, which could become an obstacle when using heterogeneous systems. Also, it is as yet unclear how the wide single instruction multiple data (SIMD) units in current multi-and many-core processors should be used most efficiently if there is no structure in the sparsity pattern of the matrix. We suggest SELL-C-σ, a variant of Sliced ELLPACK, as a SIMD-friendly data format which combines long-standing ideas from general-purpose graphics processing units and vector computer programming. We discuss the advantages of SELL-C-σ compared to established formats like Compressed Row Storage and ELLPACK and show its suitability on a variety of hardware platforms (Intel Sandy Bridge, Intel Xeon Phi, and Nvidia Tesla K20) for a wide range of test matrices from different application areas. Using appropriate performance models we develop deep insight into the data transfer properties of the SELL-C-σ spMVM kernel. SELL-C-σ comes with two tuning parameters whose performance impact across the range of test matrices is studied and for which reasonable choices are proposed. This leads to a hardware-independent ("catch-all") sparse matrix format, which achieves very high efficiency for all test matrices across all hardware platforms.

show abstract

High-performance implementation of Chebyshev filter diagonalization for interior eigenvalue computations

Pieper

Kreutzer

Alvermann

et al. 2016

Journal of Computational Physics

View full text Add to dashboard Cite

We study Chebyshev filter diagonalization as a tool for the computation of many interior eigenvalues of very large sparse symmetric matrices. In this technique the subspace projection onto the target space of wanted eigenvectors is approximated with filter polynomials obtained from Chebyshev expansions of window functions. After the discussion of the conceptual foundations of Chebyshev filter diagonalization we analyze the impact of the choice of the damping kernel, search space size, and filter polynomial degree on the computational accuracy and effort, before we describe the necessary steps towards a parallel high-performance implementation. Because Chebyshev filter diagonalization avoids the need for matrix inversion it can deal with matrices and problem sizes that are presently not accessible with rational function methods based on direct or iterative linear solvers. To demonstrate the potential of Chebyshev filter diagonalization for large-scale problems of this kind we include as an example the computation of the 10 2 innermost eigenpairs of a topological insulator matrix with dimension 10 9 derived from quantum physics applications.

show abstract

GHOST: Building Blocks for High Performance Sparse Linear Algebra on Heterogeneous Systems

Kreutzer

Thies

Röhrig-Zöllner

et al. 2016

Int J Parallel Prog

View full text Add to dashboard Cite

While many of the architectural details of future exascale-class high performance computer systems are still a matter of intense research, there appears to be a general consensus that they will be strongly heterogeneous, featuring "standard" as well as "accelerated" resources. Today, such resources are available as multicore processors, graphics processing units (GPUs), and other accelerators such as the Intel Xeon Phi. Any software infrastructure that claims usefulness for such environments must be able to meet their inherent challenges: massive multi-level parallelism, topology, asynchronicity, and abstraction. The "General, Hybrid, and Optimized Sparse Toolkit" (GHOST) is a collection of building blocks that targets algorithms dealing with sparse matrix representations on current and future large-scale systems. It implements the "MPI+X" paradigm, has a pure C interface, and provides hybrid-parallel numerical kernels, intelligent resource management, and truly heterogeneous parallelism for multicore CPUs, Nvidia GPUs, and the Intel Xeon Phi. We describe the details of its design with respect to the challenges posed by modern heterogeneous

show abstract

CRAFT: A Library for Easier Application-Level Checkpoint/Restart and Automatic Fault Tolerance

Shahzad

Thies

Kreutzer

et al. 2019

IEEE Trans. Parallel Distrib. Syst.

View full text Add to dashboard Cite

In order to efficiently use the future generations of supercomputers, fault tolerance and power consumption are two of the prime challenges anticipated by the High Performance Computing (HPC) community. Checkpoint/Restart (CR) has been and still is the most widely used technique to deal with hard failures. Application-level CR is the most effective CR technique in terms of overhead efficiency but it takes a lot of implementation effort. This work presents the implementation of our C++ based library CRAFT (Checkpoint-Restart and Automatic Fault Tolerance), which serves two purposes. First, it provides an extendable library that significantly eases the implementation of application-level checkpointing. The most basic and frequently used checkpoint data types are already part of CRAFT and can be directly used out of the box. The library can be easily extended to add more data types. As means of overhead reduction, the library offers a build-in asynchronous checkpointing mechanism and also supports the Scalable Checkpoint/Restart (SCR) library for node level checkpointing. Second, CRAFT provides an easier interface for User-Level Failure Mitigation (ULFM) based dynamic process recovery, which significantly reduces the complexity and effort of failure detection and communication recovery mechanism. By utilizing both functionalities together, applications can write application-level checkpoints and recover dynamically from process failures with very limited programming effort. This work presents the design and use of our library in detail. The associated overheads are thoroughly analyzed using several benchmarks.Thomas Zeiser holds a PhD in Computational Fluid Mechanics from the University of Erlangen-Nuremberg. He is now a senior research scientist in the HPC group of RRZE and is among many other things still interested in lattice Boltzmann methods.Georg Hager holds a PhD in Computational Physics from the University of Greifswald. He has been working with high performance systems since 1995, and is now a senior research scientist in the HPC group at RRZE. Recent research includes architecture-specific optimization for current microprocessors, performance modeling on processor and system levels, and the efficient use of hybrid parallel systems. His daily work encompasses all aspects of user support in HPC such as lectures, tutorials, training, code parallelization, profiling and optimization, and the assessment of novel computer architectures and tools.Gerhard Wellein holds a PhD in Solid State Physics from the University of Bayreuth and is a regular Professor at the Department for Computer Science at University of Erlangen. He heads the HPC group at RRZE and has more than 10 years of experience in teaching HPC techniques to students and scientists from Computational Science and Engineering. His research interests include solving large sparse eigenvalue problems, novel parallelization approaches, performance modeling, and architecture-specific optimization.

show abstract

Performance Engineering of the Kernel Polynomal Method on Large-Scale CPU-GPU Systems

Kreutzer

Pieper

Hager

et al. 2015

View full text Add to dashboard Cite

Abstract-The Kernel Polynomial Method (KPM) is a wellestablished scheme in quantum physics and quantum chemistry to determine the eigenvalue density and spectral properties of large sparse matrices. In this work we demonstrate the high optimization potential and feasibility of peta-scale heterogeneous CPU-GPU implementations of the KPM. At the node level we show that it is possible to decouple the sparse matrix problem posed by KPM from main memory bandwidth both on CPU and GPU. To alleviate the effects of scattered data access we combine loosely coupled outer iterations with tightly coupled block sparse matrix multiple vector operations, which enables pure data streaming. All optimizations are guided by a performance analysis and modelling process that indicates how the computational bottlenecks change with each optimization step. Finally we use the optimized node-level KPM with a hybrid-parallel framework to perform large scale heterogeneous electronic structure calculations for novel topological materials on a petascale-class Cray XC30 system. Keywords-Parallel programming, Quantum mechanics, Performance analysis, Sparse matricesIt is widely accepted that future supercomputer architectures will change considerably compared to the machines used at present for large scale simulations. Extreme parallelism, use of heterogeneous compute devices and a steady decrease in the architectural balance in terms of main memory bandwidth vs. peak performance are important factors to consider when developing and implementing sustainable code structures. Accelerator-based systems already account for a performance share of 34% of the total TOP500 [1] today, and they may provide first blueprints of future architectural developments. The heterogeneous hardware structure typically calls for a completely new software development, in particular if the simultaneous use of all compute devices is addressed to maximize performance and energy efficiency.A prominent example demonstrating the need for new software implementations and structures is the MAGMA project [2]. In dense linear algebra the code balance (bytes/flop) of basic operations can often be reduced by blocking techniques to better match the machine balance. Thus, this community is expected to achieve high absolute performance also on future supercomputers. In contrast, sparse linear algebra is known for low sustained performance on state of the art homogeneous systems. The sparse matrix vector multiplication (SpMV) is often the performance-critical step.Most of the broad research on optimal SpMV data structures has been devoted to drive the balance of a general SpMV (not using any special matrix properties) down to its minimum value of 6 bytes/flop (double precision) or 2.5 bytes/flop (double complex) on all architectures, which is still at least an order of magnitude away from current machine balance numbers. Just recently the long known idea of applying the sparse matrix to multiple vectors at the same time (SpMMV) (see, e.g., [3]), to reduce computational balance has gai...

show abstract

scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.

Contact Info

customersupport@researchsolutions.com

10624 S. Eastern Ave., Ste. A-614

Henderson, NV 89052, USA

This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.

Blog Terms and Conditions API Terms Privacy Policy Contact Cookie Preferences Do Not Sell or Share My Personal Information

Made with 💙 for researchers

Part of the Research Solutions Family.

Moritz Kreutzer

A Unified Sparse Matrix Data Format for Efficient General Sparse Matrix-Vector Multiplication on Modern Processors with Wide SIMD Units

High-performance implementation of Chebyshev filter diagonalization for interior eigenvalue computations

GHOST: Building Blocks for High Performance Sparse Linear Algebra on Heterogeneous Systems

CRAFT: A Library for Easier Application-Level Checkpoint/Restart and Automatic Fault Tolerance

Performance Engineering of the Kernel Polynomal Method on Large-Scale CPU-GPU Systems

Contact Info

Product

Resources

About