Automatic Code Generation and Optimization of Large-scale Stencil Computation on Many-core Processors

Li, Mingzhen; Liu, Yi; Yang, Hailong; Hu, Yongmin; Sun, Qingxiao; Chen, Bangduo; You, Xiao‐Zeng; Liu, Xiaoyan; Luan, Zhongzhi; Qian, Depei

doi:10.1145/3472456.3473517

Cited by 17 publications

(3 citation statements)

References 29 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Application-specific performance models [32,41,60] introduce domain knowledge into the prediction and often use the generated communication or performance model to inform an optimization search without executing the program, which might be expensive due to running on distributed environments.…”

Section: Related Workmentioning

confidence: 99%

Performance Embeddings: A Similarity-based Approach to Automatic Performance Optimization

Trumper¹,

Ben-Nun²,

Philipp³

et al. 2023

Preprint

View full text Add to dashboard Cite

Performance optimization is an increasingly challenging but often repetitive task. While each platform has its quirks, the underlying code transformations rely on data movement and computational characteristics that recur across applications. This paper proposes to leverage those similarities by constructing an embedding space for subprograms. The continuous space captures both static and dynamic properties of loop nests via symbolic code analysis and performance profiling, respectively. Performance embeddings enable direct knowledge transfer of performance tuning between applications, which can result from autotuning or tailored improvements. We demonstrate this transfer tuning approach on case studies in deep neural networks, dense and sparse linear algebra compositions, and numerical weather prediction stencils. Transfer tuning reduces the search complexity by up to four orders of magnitude and outperforms the MKL library in sparse-dense matrix multiplication. The results exhibit clear correspondences between program characteristics and optimizations, outperforming prior specialized state-of-the-art approaches and generalizing beyond their capabilities.

show abstract

Section: Related Workmentioning

confidence: 99%

Performance Embeddings: A Similarity-based Approach to Automatic Performance Optimization

Trumper¹,

Ben-Nun²,

Philipp³

et al. 2023

Preprint

View full text Add to dashboard Cite

show abstract

“…Existing open-source RLHF frameworks such as Transformer Reinforcement Learning (TRL), Colos-salChat (CAIChat), and DeepSpeed-Chat (DSChat) rely on parallelization approaches like Zero Redundancy Optimizer (ZeRO) to co-locate the four models involved in RLHF training on the same GPU [14,28,20]. However, as models continue to grow past 70 billion parameters, this scheduling approach becomes increasingly inefficient with limited GPU memory.…”

Section: Introductionmentioning

confidence: 99%

A facile and scalable patterning approach for ultrastretchable liquid metal features

Wang

et al. 2022

Lab Chip

View full text Add to dashboard Cite

show abstract

“…For the given dimension, the distance between the center point and its farthest neighbor is denoted as the radius of the stencil. Due to the intrinsic nature, stencil computation often suffers from low memory bandwidth and poor locality [40] on modern processors, which makes it notorious for performance optimization [18,20,44].…”

mentioning

confidence: 99%

Toward accelerated stencil computation by adapting tensor core unit on GPU

Liu

Yang

et al. 2022

Proceedings of the 36th ACM International Conference on Supercomputing

Self Cite

View full text Add to dashboard Cite

The Tensor Core Unit (TCU) has been increasingly adopted on modern high performance processors, specialized in boosting the performance of general matrix multiplication (GEMM). Due to its highly optimized hardware design, TCU can significantly accelerate GEMM-based operations widely used in scientific as well as deep learning applications. However, there is few work exploiting TCU to accelerate non-GEMM operations such as stencil computation that is also important in the field of high performance computing. To the best of our knowledge, there is no previous work that adapts stencil computation to TCU efficiently by considering its unique characteristics. In this paper, we propose a new method called TCstencil to adapt TCU for accelerating stencil computation. Specifically, we re-design the stencil computation as a series of reduction and summation operations in order to leverage the computing power of TCU. In addition, we propose corresponding optimizations for better exploiting TCU and memory hierarchy on GPU. We evaluate our method with different stencils and input mesh sizes on NVIDIA A100 and V100 GPUs. The experiment results demonstrate our method can achieve superior performance compared to the state-of-the-art stencil optimization frameworks.

show abstract

Automatic Code Generation and Optimization of Large-scale Stencil Computation on Many-core Processors

Cited by 17 publications

References 29 publications

Performance Embeddings: A Similarity-based Approach to Automatic Performance Optimization

Performance Embeddings: A Similarity-based Approach to Automatic Performance Optimization

A facile and scalable patterning approach for ultrastretchable liquid metal features

Toward accelerated stencil computation by adapting tensor core unit on GPU

Contact Info

Product

Resources

About