Efficient Block Algorithms for Parallel Sparse Triangular Solve

Lu, Zhengyang; Niu, Yuyao; Liu, Weifeng

doi:10.1145/3404397.3404413

Cited by 21 publications

(7 citation statements)

References 67 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Ultimately, for each GPU the accumulated communication time of each DAG level is the final communication time. The communication time on each level is estimated using the number of non-overlapped messages in GPU p (line [30][31][32][33]. The row reduction follows the same manner.…”

Section: Sptrsv Performance Model For Gpusmentioning

confidence: 99%

“…Exploring high performance SpTRSV is becoming ever more crucial on GPU-accelerated architectures. Most existing parallel GPU triangular solvers focus on optimizing single GPU performance [6,[30][31][32][33]. Due to the complex data dependencies in SpTRSV, algorithm optimization has been mainly based on the level-set methods and color-set methods for various parallel architectures.…”

Section: Related Workmentioning

confidence: 99%

“…The first thread block is used for column broadcasts while the second is used for row reductions. Consider the thread block used for block column broadcasts (line [28][29][30][31][32][33], recall that we have calculated a mask vector M c in the pre-processing phase. Here we distribute this mask among the threads in this thread block, and let each thread wait for a sub-set of all the block columns.…”

mentioning

confidence: 99%

See 2 more Smart Citations

A Message-Driven, Multi-GPU Parallel Sparse Triangular Solver

Ding¹,

Liu²,

Williams³

et al. 2021

SIAM Conference on Applied and Computational Discrete Algorithms (ACDA21)

View full text Add to dashboard Cite

Sparse triangular solve is used in conjunction with Sparse LU for solving sparse linear systems, either as a direct solver or as a preconditioner. As GPUs have become a firstclass compute citizen, designing an efficient and scalable SpTRSV on multi-GPU HPC systems is imperative. In this paper, we leverage the advantage of GPU-initiated data transfers of NVSHMEM to implement and evaluate a Multi-GPU SpTRSV. We create a novel producer-consumer paradigm to manage the computation and communication in SpTRSV and implement it using two CUDA streams. Our multi-GPU SpTRSV implementation using CUDA streams achieves a 3.7× speedup when using twelve GPUs (two nodes) relative to our implementation on a single GPU, and up to 6.1× compared to cusparse csrsv2() over the range of one to eighteen GPUs. To further explain the observed performance and explore the key features of matrices to estimate the potential performance benefits when using multi-GPU, we extend the critical path model of SpTRSV to GPUs. We demonstrate the ability of our performance model to understand various aspects of performance and performance bottlenecks on multi-GPU and motivate code optimizations.

show abstract

Section: Sptrsv Performance Model For Gpusmentioning

confidence: 99%

Section: Related Workmentioning

confidence: 99%

mentioning

confidence: 99%

See 1 more Smart Citation

A Message-Driven, Multi-GPU Parallel Sparse Triangular Solver

Ding¹,

Liu²,

Williams³

et al. 2021

SIAM Conference on Applied and Computational Discrete Algorithms (ACDA21)

View full text Add to dashboard Cite

show abstract

“…Then we pass the remote access to the node to reduce the number of interconnect communication. Then, for solve-update phase, we use similar method to collect all the system-wide le f t.sum to solve the component x and update the intermediate data locally for its dependents using hybrid memory system(line [28][29][30][31][32][33][34][35]. Note that this method still employs device-wide atomic operations to update the intermediate value as multiple updates from different warps of one PE may happen simultaneously.…”

Section: B Sptrsv Design With Nvshmemmentioning

confidence: 99%

“…Concurrent data structures are fundamental building-blocks for real-world applications. Existing works have proposed various novel data structures to handle the dependencies inside SpTRSV [4], [6]- [10], [34]. For better reusing the right-side-hands on Sunway architecture, Wang et al [4] tile the sparse matrix to control the data flow and explore inter-level parallel for SpTRSV.…”

Section: Related Workmentioning

confidence: 99%

Fast and Scalable Sparse Triangular Solver for Multi-GPU Based HPC Architectures

Xie

Chen

Firoz

et al. 2021

50th International Conference on Parallel Processing

View full text Add to dashboard Cite

Designing efficient and scalable sparse linear algebra kernels on modern multi-GPU based HPC systems is a daunting task due to significant irregular memory references and workload imbalance across the GPUs. This is particularly the case for Sparse Triangular Solver (SpTRSV) which introduces additional two-dimensional computation dependencies among subsequent computation steps. Dependency information is exchanged and shared among GPUs, thus warrant for efficient memory allocation, data partitioning, and workload distribution as well as fine-grained communication and synchronization support. In this work, we demonstrate that directly adopting unified memory can adversely affect the performance of SpTRSV on multi-GPU architectures, despite linking via fast interconnect like NVLinks and NVSwitches. Alternatively, we employ the latest NVSHMEM technology based on Partitioned Global Address Space programming model to enable efficient fine-grained communication and drastic synchronization overhead reduction. Furthermore, to handle workload imbalance, we propose a malleable task-pool execution model which can further enhance the utilization of GPUs. By applying these techniques, our experiments on the NVIDIA multi-GPU supernode V100-DGX-1 and DGX-2 systems demonstrate that our design can achieve on average 3.53× (up to 9.86×) speedup on a DGX-1 system and 3.66 × (up to 9.64×) speedup on a DGX-2 system with 4-GPUs over the Unified-Memory design. The comprehensive sensitivity and scalability studies also show that the proposed zero-copy SpTRSV is able to fully utilize the computing and communication resources of the multi-GPU system.

show abstract