SIAM Conference on Applied and Computational Discrete Algorithms (ACDA21) 2021
DOI: 10.1137/1.9781611976830.14
|View full text |Cite
|
Sign up to set email alerts
|

A Message-Driven, Multi-GPU Parallel Sparse Triangular Solver

Abstract: Sparse triangular solve is used in conjunction with Sparse LU for solving sparse linear systems, either as a direct solver or as a preconditioner. As GPUs have become a firstclass compute citizen, designing an efficient and scalable SpTRSV on multi-GPU HPC systems is imperative. In this paper, we leverage the advantage of GPU-initiated data transfers of NVSHMEM to implement and evaluate a Multi-GPU SpTRSV. We create a novel producer-consumer paradigm to manage the computation and communication in SpTRSV and im… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
5
0

Year Published

2022
2022
2024
2024

Publication Types

Select...
3
2

Relationship

2
3

Authors

Journals

citations
Cited by 5 publications
(5 citation statements)
references
References 32 publications
0
5
0
Order By: Relevance
“…SuperLU For the multiGPU sparse triangular solve (SpTRSV), we leverage the advantage of GPU-initiated data transfers of NVSHMEM. The new multiGPU SpTRSV implementation using two CUDA streams achieves a 3.7× speedup when using twelve GPUs (two nodes of the Summit supercomputer at ORNL) relative to our implementation on a single GPU, and up to 6.1× compared to NVIDIA's cuSOLVER (cuSPARSE csrsv2) over the range of one to eighteen GPUs [164]. In the new v7.0.0 release of SuperLU_DIST, we released the 3D factorization algorithm, where MPI ranks are arranged as 3D process grid: Px X Py X Pz.…”
Section: Recent Progressmentioning
confidence: 87%
“…SuperLU For the multiGPU sparse triangular solve (SpTRSV), we leverage the advantage of GPU-initiated data transfers of NVSHMEM. The new multiGPU SpTRSV implementation using two CUDA streams achieves a 3.7× speedup when using twelve GPUs (two nodes of the Summit supercomputer at ORNL) relative to our implementation on a single GPU, and up to 6.1× compared to NVIDIA's cuSOLVER (cuSPARSE csrsv2) over the range of one to eighteen GPUs [164]. In the new v7.0.0 release of SuperLU_DIST, we released the 3D factorization algorithm, where MPI ranks are arranged as 3D process grid: Px X Py X Pz.…”
Section: Recent Progressmentioning
confidence: 87%
“…They can sometimes scale to thousands of and even millions of cores. [57][58][59] They can distribute the memory footprint, and can fit in the 512 GB HBM3 in a 2026 disaggregated system which is larger than the currently provided node-local DDR (256 GB DDR on a 2021 HPC system). We use GEMM 60 and STREAM 61 as two representative benchmarks to show the implications as the data size grows.…”
Section: Traditional Hpc Workload Bookendsmentioning
confidence: 99%
“…Traditional HPC workloads are designed for distributed‐memory systems. They can sometimes scale to thousands of and even millions of cores 57–59 . They can distribute the memory footprint, and can fit in the 512 GB HBM3 in a 2026 disaggregated system which is larger than the currently provided node‐local DDR (256 GB DDR on a 2021 HPC system).…”
Section: Application Case Studiesmentioning
confidence: 99%
“…Traditional HPC workloads are designed for distributed-memory systems. They can sometimes scale to thousands of and even millions of cores [39][40][41]. They can distribute the memory footprint, and can fit in the 512 GB HBM3 in a 2026 disaggregated system which is larger than the currently provided node-local DDR (256 GB DDR on a 2021 HPC system).…”
Section: B Application Characteristicsmentioning
confidence: 99%