AKG: automatic kernel generation for neural processing units using polyhedral transformations

Zhao, Jie; Li, Bojie; Nie, Wang; Geng, Zhen; Zhang, Renwei; Gao, Xiong; Cheng, Bin; Wu, Chen; Cheng, Yun; Li, Zheng; Peng, Di; Zhang, Kun; Jin, Xuefeng

doi:10.1145/3453483.3454106

Cited by 43 publications

(20 citation statements)

References 72 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Those passes have complex optimization rules for different domain-specific code structures (e.g., big loops, large buffer allocation, and thread scheduling) that general-purpose mutators can hardly target. Hence, according to the hot spot program patterns targeted by existing tensor compilers [Chen et al 2018;Ragan-Kelley et al 2013;Tillet et al 2019;Zhao et al 2021], Tzer specifically designed 3 types of mutators: 1) loop-nesting mutator for creating multifarious dense loop structures; 2) memory-operation mutator for various memory allocation/store/load patterns at the index level; and 3) thread-binding mutator for diversifying the parallel computation flows to generate interesting code patterns that tensor compilers particularly care about. Loop Nesting.…”

Section: Domain-specific Mutationmentioning

confidence: 99%

“…However, hand-crafted optimization is time-consuming in the long run and a fixed binary cannot meet the ultimate performance requirements for all hardware vendors. Therefore, to fundamentally resolve those challenges, recently DL infrastructures have been focusing on developing tensor compilers [Chen et al 2018;Google 2016;Intel 2017;Jin et al 2020;Rotem et al 2018;Tillet et al 2019;Zhao et al 2021] to automatically generate best-in-class target code for different vendors or even architectures.…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Coverage-Guided Tensor Compiler Fuzzing with Joint IR-Pass Mutation

Liu¹,

Wei²,

Yang³

et al. 2022

Preprint

View full text Add to dashboard Cite

In the past decade, Deep Learning (DL) systems have been widely deployed in various application domains to facilitate our daily life, e.g., natural language processing, healthcare, activity recognition, and autonomous driving. Meanwhile, it is extremely challenging to ensure the correctness of DL systems (e.g., due to their intrinsic nondeterminism), and bugs in DL systems can cause serious consequences and may even threaten human lives. In the literature, researchers have explored various techniques to test, analyze, and verify DL models, since their quality directly affects the corresponding system behaviors. Recently, researchers have also proposed novel techniques for testing the underlying operator-level DL libraries (such as TensorFlow and PyTorch), which provide general binary implementations for each high-level DL operator and are the foundation for running DL models on different hardware platforms. However, there is still limited work targeting the reliability of the emerging tensor compilers (also known as DL compilers), which aim to automatically compile high-level tensor computation graphs directly into high-performance binaries for better efficiency, portability, and scalability than traditional operator-level libraries. Therefore, in this paper, we target the important problem of tensor compiler testing, and have proposed Tzer, a practical fuzzing technique for the widely used TVM tensor compiler. Tzer focuses on mutating the low-level Intermediate Representation (IR) for TVM due to the limited mutation space for the high-level IR. More specifically, Tzer leverages both general-purpose and tensor-compiler-specific mutators guided by coverage feedback for diverse and evolutionary IR mutation; furthermore, since tensor compilers provide various passes (i.e., transformations) for IR optimization, Tzer also performs pass mutation in tandem with IR mutation for more effective fuzzing. Our experimental results show that Tzer substantially outperforms existing fuzzing techniques on tensor compiler testing, with 75% higher coverage and 50% more valuable tests than the 2nd-best technique. Also, different components of Tzer have been validated via ablation study. To date, Tzer has detected 49 previously unknown bugs for TVM, with 37 bugs confirmed and 25 bugs fixed (PR merged).

show abstract

Section: Domain-specific Mutationmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

Coverage-Guided Tensor Compiler Fuzzing with Joint IR-Pass Mutation

Liu¹,

Wei²,

Yang³

et al. 2022

Preprint

View full text Add to dashboard Cite

show abstract

“…AutoTVM [10] Hand-written Templates + Tuning Ansor [68] Generation Rules + Tuning UNIT [58] Hand-written Templates XLA [18] Templates and Rules ISA Mapper [52] Templates and Rules + Tuning Tiramisu [4] Polyhedral Model AKG [67] Polyhedral Model + Templates AMOS Analyzable Abstraction + Tuning following two intrinsics are from Tensor Core WMMA. mma_sync is a matrix multiplication intrinsic (compute) and load_matrix_sync is a matrix load intrinsic (memory).…”

Section: Namementioning

confidence: 99%

“…For example, TVM [9] exposes a tensorize interface for users to configure their own intrinsics and the users have to manually invoke intrinsics when implementing the software. Polyhedral compilers such as AKG [67] relies on a combination of polyhedral model and templates to map software onto spatial accelerators. AutoTVM [10] and UNIT [58] use hand-tuned templates with intrinsics to support a narrow range of operators and accelerators.…”

Section: Existing Mapping Flowmentioning

confidence: 99%

“…The quality of the mapping is apparently critical to the performance as different mappings vary substantially by affecting data locality and parallelism. But existing compilers [9,10,52,58,67] heavily rely on manual programming with intrinsics to develop libraries or templates, which may miss the optimal mapping choice. Second, different accelerators provide different intrinsics with intricate compute and memory semantics.…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Amos

Zheng

Chen

Wei

et al. 2022

Proceedings of the 49th Annual International Symposium on Computer Architecture

View full text Add to dashboard Cite

Hardware specialization is a promising trend to sustain performance growth. Spatial hardware accelerators that employ specialized and hierarchical computation and memory resources have recently shown high performance gains for tensor applications such as deep learning, scientific computing, and data mining. To harness the power of these hardware accelerators, programmers have to use specialized instructions with certain hardware constraints. However, these hardware accelerators and instructions are quite new and there is a lack of understanding of the hardware abstraction, performance optimization space, and automatic methodologies to explore the space. Existing compilers use handtuned computation implementations and optimization templates, resulting in sub-optimal performance and heavy development costs.In this paper, we propose AMOS, which is an automatic compilation framework for spatial hardware accelerators. Central to this framework is the hardware abstraction that not only clearly specifies the behavior of spatial hardware instructions, but also formally defines the mapping problem from software to hardware. Based on * Work done while the author was a student at Peking University.

show abstract