Kokkos Kernels: Performance Portable Sparse/Dense Linear Algebra and Graph Kernels

Rajamanickam, Sivasankaran; Acer, Seher; Berger-Vergiat, Luc; Dang, Vinh Quang; Ellingwood, Nathan David; Harvey, Evan; Kelley, Brian P.; Trott, Christian Robert; Wilke, Jeremiah J; Yamazaki, Ichitaro

doi:10.48550/arxiv.2103.11991

Cited by 3 publications

(4 citation statements)

References 22 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…One can solve these dense linear systems effectively using modern manycore CPUs and GPUs [5], in fact this area of research has been the focus of the community for past several years. Libraries such as Kokkos Kernels [6], MAGMA [7], cuSOLVER provide implementations of such solvers. However, several recent formulations result in these small systems themselves being sparse.…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Performance Portable Batched Sparse Linear Solvers

Liegeois

Rajamanickam

Berger-Vergiat

2023

IEEE Trans. Parallel Distrib. Syst.

View full text Add to dashboard Cite

Solving large number of small linear systems is increasingly becoming a bottleneck in computational science applications. While dense linear solvers for such systems have been studied before, batched sparse linear solvers are just starting to emerge. In this paper, we discuss algorithms for solving batched sparse linear systems and their implementation in the Kokkos Kernels library. The new algorithms are performance portable and map well to the hierarchical parallelism available in modern accelerator architectures. The sparse matrix vector product (SPMV) kernel is the main performance bottleneck of the Krylov solvers we implement in this work. The implementation of the batched SPMV and its performance are therefore discussed thoroughly in this paper. The implemented kernels are tested on different Central Processing Unit (CPU) and Graphic Processing Unit (GPU) architectures. We also develop batched Conjugate Gradient (CG) and batched Generalized Minimum Residual (GMRES) solvers using the batched SPMV. Our proposed solver was able to solve 20,000 sparse linear systems on V100 GPUs with a mean speedup of 76x and 924x compared to using a parallel sparse solver with a block diagonal system with all the small linear systems, and compared to solving the small systems one at a time, respectively. We see mean speedup of 0.51 compared to dense batched solver of cuSOLVER on V100, while using lot less memory. Thorough performance evaluation on three different architectures and analysis of the performance are presented.

show abstract

Section: Introductionmentioning

confidence: 99%

“…• A performance portable implementation of these solvers using the Kokkos library made available publicly in the Kokkos Kernels library [6].…”

Section: Introductionmentioning

confidence: 99%

Performance Portable Batched Sparse Linear Solvers

Liegeois

Rajamanickam

Berger-Vergiat

2023

IEEE Trans. Parallel Distrib. Syst.

View full text Add to dashboard Cite

show abstract

“…Based on[26] that also uses the DAG layer partitioning, but instead of using global barriers, specialized point-to-point sparsified barriers are used for only the threads involved in dependencies, avoiding unnecessary stalls of the rest of the threads. Performance of the sparse kernels from the portable Kokkos library V3.4.1[27] is benchmarked.…”

mentioning

confidence: 99%

GraphOpt: Constrained-Optimization-Based Parallelization of Irregular Graphs

Shah

Meert

Verhelst

2022

IEEE Trans. Parallel Distrib. Syst.

View full text Add to dashboard Cite

Sparse, irregular graphs show up in various applications like linear algebra, machine learning, engineering simulations, robotic control, etc. These graphs have a high degree of parallelism, but their execution on parallel threads of modern platforms remains challenging due to the irregular data dependencies. The execution performance can be improved by efficiently partitioning the graphs such that the communication and thread synchronization overheads are minimized without hurting the utilization of the threads. To achieve this, this paper proposes GRAPHOPT, a tool that models the graph parallelization as a constrained optimization problem and uses the open Google OR-Tools solver to find good partitions. Several scalability techniques are developed to handle large real-world graphs with millions of nodes and edges. Extensive experiments are performed on the graphs of sparse matrix triangular solves (linear algebra) and sum-product networks (machine learning), respectively, showing a mean speedup of 2.0× and 1.8× over previous state-of-the-art libraries, demonstrating the effectiveness of the constrained-optimization-based graph parallelization.

show abstract

“…In the wake of the rising importance of graph-based computations, the hardware landscape within the compute industry began to undergo key shifts. Traditional Central Processing Units (CPUs), initially designed for sequential tasks, started incorporating SIMD-based graph extensions to enhance parallel processing capabilities [215].Graphics Processing Units (GPUs), with their inherent parallelism, were enhanced with kernel support tailored specifically for graph algorithms [148,174]. Beyond these general-purpose processors, the industry also witnessed the advent of domain-specific accelerators [86,115,153,202], specifically crafted to speedup graph computations, addressing the unique challenges and demands that graph algorithms present.…”

Section: Background and Motivationmentioning

confidence: 99%

Enabling accelerators for graph computing

Shivdikar

View full text Add to dashboard Cite

show abstract

Kokkos Kernels: Performance Portable Sparse/Dense Linear Algebra and Graph Kernels

Cited by 3 publications

References 22 publications

Performance Portable Batched Sparse Linear Solvers

Performance Portable Batched Sparse Linear Solvers

GraphOpt: Constrained-Optimization-Based Parallelization of Irregular Graphs

Enabling accelerators for graph computing

Contact Info

Product

Resources

About