ALT: Breaking the Wall between Data Layout and Loop Optimizations for Deep Learning Compilation

Xu, Zhiying; Xu, Jiafan; Peng, Hongding; Wang, Wei; Wang, Xiaoliang; Wan, Haoran; Dai, Haipeng; Xu, Yixu; Cheng, Hao; Wang, Kun; Chen, Guihai

doi:10.1145/3552326.3587440

Cited by 3 publications

(1 citation statement)

References 34 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…One such instance arises in the context of Automated Machine Learning (AutoML), where the network architecture undergoes continuous evolution in pursuit of the optimal configuration tailored to a specific input dataset [13], [64]. In that sense, the autotuning of computational kernels has garnered significant interest, offering various strategies to optimize GEMM kernels based on the specific characteristics of the input problem and hardware architecture [10], [30], [59], [66], [71].…”

Section: Introductionmentioning

confidence: 99%

STuning-DL: Model-Driven Autotuning of Sparse GPU Kernels for Deep Learning

Castro,

Andrade,

Fraguela

2024

IEEE Access

View full text Add to dashboard Cite

The relentless growth of modern Machine Learning models has spurred the adoption of sparsification techniques to simplify their architectures and reduce the computational demands. Network pruning has demonstrated success in maintaining original network accuracy while shedding significant portions of the original weights. However, leveraging this sparsity efficiently remains challenging due to computational irregularities, particularly in GPU kernels. A new trend of template-based GPU kernels for semi-structured sparsity shows promise in efficiency but lacks autotuning capabilities to adapt to input dynamics, often underperforming in scenarios where they have not been meticulously hand-tuned. We present STuning-DL, the first pruning-aware autotuner for third-party template-based implementations enabling efficient optimization of sparse kernels for Deep Learning, spanning from high-level aspects (CUDA C++ level) down to GPU-native instructions specifics (assembly-level). STuning-DL tunes and optimizes at run-time sparse kernels' performance for each input problem, yielding speedups of up to 5.42× on NVIDIA T4-16GB and up to 3.6× on NVIDIA A100-40GB GPU in sparse matrices from real world models compared to existing heuristics from sparse libraries like cuSparse and cuSparseLt.

show abstract

Section: Introductionmentioning

confidence: 99%

STuning-DL: Model-Driven Autotuning of Sparse GPU Kernels for Deep Learning

Castro,

Andrade,

Fraguela

2024

IEEE Access

View full text Add to dashboard Cite

show abstract

Detecting Numerical Deviations in Deep Learning Models Introduced by the TVM Compiler

Xia,

Chen,

Nie

et al. 2024

2024 IEEE 35th International Symposium on Software Reliability Engineering (ISSRE)

View full text Add to dashboard Cite

Cross-Feature Transfer Learning for Efficient Tensor Program Generation

Verma,

Raskar,

Emani

et al. 2024

Applied Sciences

View full text Add to dashboard Cite

Tuning tensor program generation involves navigating a vast search space to find optimal program transformations and measurements for a program on the target hardware. The complexity of this process is further amplified by the exponential combinations of transformations, especially in heterogeneous environments. This research addresses these challenges by introducing a novel approach that learns the joint neural network and hardware features space, facilitating knowledge transfer to new, unseen target hardware. A comprehensive analysis is conducted on the existing state-of-the-art dataset, TenSet, including a thorough examination of test split strategies and the proposal of methodologies for dataset pruning. Leveraging an attention-inspired technique, we tailor the tuning of tensor programs to embed both neural network and hardware-specific features. Notably, our approach substantially reduces the dataset size by up to 53% compared to the baseline without compromising Pairwise Comparison Accuracy (PCA). Furthermore, our proposed methodology demonstrates competitive or improved mean inference times with only 25–40% of the baseline tuning time across various networks and target hardware. The attention-based tuner can effectively utilize schedules learned from previous hardware program measurements to optimize tensor program tuning on previously unseen hardware, achieving a top-5 accuracy exceeding 90%. This research introduces a significant advancement in autotuning tensor program generation, addressing the complexities associated with heterogeneous environments and showcasing promising results regarding efficiency and accuracy.

show abstract

ALT: Breaking the Wall between Data Layout and Loop Optimizations for Deep Learning Compilation

Cited by 3 publications

References 34 publications

STuning-DL: Model-Driven Autotuning of Sparse GPU Kernels for Deep Learning

STuning-DL: Model-Driven Autotuning of Sparse GPU Kernels for Deep Learning

Detecting Numerical Deviations in Deep Learning Models Introduced by the TVM Compiler

Cross-Feature Transfer Learning for Efficient Tensor Program Generation

Contact Info

Product

Resources

About