Probing the Efficacy of Hardware-Aware Weight Pruning to Optimize the SpMM Routine on Ampere GPUs

Castro, Roberto; Andrade, Diego; Fraguela, Basilio B.

doi:10.1145/3559009.3569691

Cited by 6 publications

(7 citation statements)

References 18 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Vector-wise pruning can accelerate sparse routines on GPUs. However, if the vector length is greater than 8, it can significantly reduce the accuracy [4,5,25]. The results demonstrate that the V:N:M format occupies an intermediate position between unstructured and vector-wise pruning.…”

Section: Energy Evaluation Of V:n:mmentioning

confidence: 95%

“…In this context, cuSparseLt SpMM implementation is the reference library to exploit the 2:4 format on SPTCs. Since there are no SpMM GPU implementations for arbitrary N:M sparsity levels, we have considered in the evaluation the following third-party libraries that support half-precision: Sputnik [11], and CLASP [4] which extends vectorSparse [5] to the latest generations of NVIDIA GPU architectures. While [11] has been designed for non-structured sparse matrices, [4] is focused on semi-structured sparse input matrices following the column-vector sparse format, which supports vector lengths 𝑙 = 2, 4 and 8.…”

Section: Comparison With Existing Dense and Sparse Librariesmentioning

confidence: 99%

“…It is based on using semi-structured 1D pruning, and a special compressed format called Column-Vector Sparse Encoding. As a continuation, CLASP [4] offers an SPMM implementation which extends the support of Vector-Sparse to the Ampere architecture.…”

Section: Related Workmentioning

confidence: 99%

See 2 more Smart Citations

VENOM: A Vectorized N:M Format for Unleashing the Power of Sparse Tensor Cores

Castro,

Ivanov,

Andrade

et al. 2023

Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis

Self Cite

View full text Add to dashboard Cite

The increasing success and scaling of Deep Learning models demands higher computational efficiency and power. Sparsification can lead to both smaller models as well as higher compute efficiency, and accelerated hardware is becoming available. However, exploiting it efficiently requires kernel implementations, pruning algorithms, and storage formats, to utilize hardware support of specialized sparse vector units. An example of those are the NVIDIA's Sparse Tensor Cores (SPTCs), which promise a 2× speedup. However, SPTCs only support the 2:4 format, limiting achievable sparsity ratios to 50%. We present the V:N:M format, which enables the execution of arbitrary N:M ratios on SPTCs. To efficiently exploit the resulting format, we propose Spatha, a high-performance sparselibrary for DL routines. We show that Spatha achieves up to 37× speedup over cuBLAS. We also demonstrate a second-order pruning technique that enables sparsification to high sparsity ratios with V:N:M and little to no loss in accuracy in modern transformers.

show abstract

Section: Energy Evaluation Of V:n:mmentioning

confidence: 95%

Section: Comparison With Existing Dense and Sparse Librariesmentioning

confidence: 99%

See 1 more Smart Citation

VENOM: A Vectorized N:M Format for Unleashing the Power of Sparse Tensor Cores

Castro,

Ivanov,

Andrade

et al. 2023

Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis

Self Cite

View full text Add to dashboard Cite

show abstract

“…Sparse input matrices are inherently shaped by the pruning algorithms, which can generate highly irregular sparse matrices [16]. This irregularity, however, can significantly undermine performance due to inefficient hardware utilization [4]. Therefore, a new trend of semi-structured pruning techniques, which aims to find trade-offs between performance and accuracy, can yield quite structured patterns that offer better performance, but little to no room for tuning their representation [29].…”

Section: Introductionmentioning

confidence: 99%

“…Problem 2: Poor generalization to new input problems and other platforms. The daunting task of crafting efficient kernels for sparse computation in DL has spurred the proliferation of specialized kernels tailored to address specific input problem shapes and hardware architectures [4], [7], [16]. This limitation recognizes the inherent difficulty of preserving the performance across all conceivable scenarios.…”

Section: Introductionmentioning

confidence: 99%

STuning-DL: Model-Driven Autotuning of Sparse GPU Kernels for Deep Learning

Castro,

Andrade,

Fraguela

2024

IEEE Access

Self Cite

View full text Add to dashboard Cite

The relentless growth of modern Machine Learning models has spurred the adoption of sparsification techniques to simplify their architectures and reduce the computational demands. Network pruning has demonstrated success in maintaining original network accuracy while shedding significant portions of the original weights. However, leveraging this sparsity efficiently remains challenging due to computational irregularities, particularly in GPU kernels. A new trend of template-based GPU kernels for semi-structured sparsity shows promise in efficiency but lacks autotuning capabilities to adapt to input dynamics, often underperforming in scenarios where they have not been meticulously hand-tuned. We present STuning-DL, the first pruning-aware autotuner for third-party template-based implementations enabling efficient optimization of sparse kernels for Deep Learning, spanning from high-level aspects (CUDA C++ level) down to GPU-native instructions specifics (assembly-level). STuning-DL tunes and optimizes at run-time sparse kernels' performance for each input problem, yielding speedups of up to 5.42× on NVIDIA T4-16GB and up to 3.6× on NVIDIA A100-40GB GPU in sparse matrices from real world models compared to existing heuristics from sparse libraries like cuSparse and cuSparseLt.

show abstract