CoSA: Scheduling by Constrained Optimization for Spatial Accelerators

Huang, Qijing; Kang, Minwoo; Dinh, Grace; Norell, Thomas; Kalaiah, Aravind; Demmel, James; Wawrzynek, John; Shao, Yakun Sophia

doi:10.48550/arxiv.2105.01898

Cited by 3 publications

(9 citation statements)

References 68 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Mathematical programming-based methods. These methods formulate the search task as a mathematical optimization problem and resolve it with optimization solvers [4,17,26,27]. The recent work CoSA [17] modeled the design space as a mixed integer programming (MIP) problem.…”

Section: Search Methodsmentioning

confidence: 99%

“…There is a plethora of previous works on performance tuning of systolic arrays [4,9,15,17,19,21,28,31,41]. Table 1 lists several recent works.…”

Section: Background and Related Workmentioning

confidence: 99%

“…In addition, in this work, we cover non-divisor tiling factors in loop tiling. Many previous works only used divisors for simplicity [4,9,17,31,41]. However, this choice could lead to significant performance loss, as discussed in Section 1.…”

Section: Design Spacementioning

confidence: 99%

“…Many previous works have attempted this challenging task by looking into different dimensions of the design space and proposing various auto-tuning methods [4,9,17,19,21,28,31,41]. However, after a thorough examination of the previous works, we identified several limitations that need to be addressed.…”

Section: Introductionmentioning

confidence: 99%

“…Limitation 1: Incomplete coverage of the design space. When selecting the tiling factors, many previous works only considered divisors to reduce the design space [4,9,17,41]. In Figure 1, we compare the throughput and DSP usage of best systolic arrays found when limiting tiling factors to 1) divisors only (Design 1) and 2) both divisors and non-divisors (Design 4) for a 1024 × 1024 × 1024 matrix multiplication (MM) 2 .…”

Section: Introductionmentioning

confidence: 99%

See 4 more Smart Citations

Search for Optimal Systolic Arrays: A Comprehensive Automated Exploration Framework and Lessons Learned

Wang

Cong

2021

Preprint

View full text Add to dashboard Cite

Systolic arrays have been widely used for accelerating HPC and deep learning applications. There is a plethora of previous works on the performance tuning of systolic arrays, but usually based on a number of oversimplified assumptions (e.g., only considering divisors for loop tiling, pruning based on off-chip data communication) to reduce the design space.In this paper, we present a comprehensive design space exploration tool named Odyssey for systolic array optimization. Odyssey does not rely on artificial assumptions to limit the design space, and yet it is highly efficient and scalable with a hybrid optimization technique. For example, for a 1024×1024×1024 matrix multiplication, it finds designs that reach 90% of the optimal performance in 5 seconds with a single CPU thread. Moreover, using Odyssey, we unveil and quantify the suboptimality introduced by multiple commonly used oversimplifications in prior studies for systolic array design space exploration. For example, Odyssey results show that limiting to divisors for loop tiling leads to a 39% performance loss, and pruning based on off-chip data movement results in a 45% performance loss. We applied Odyssey to explore the architecture trade-offs for matrix multiplication and convolutional neural network, providing inspiration into possible optimizations for these two applications.

show abstract

Section: Search Methodsmentioning

confidence: 99%

“…There is a plethora of previous works on performance tuning of systolic arrays [4,9,15,17,19,21,28,31,41]. Table 1 lists several recent works.…”

Section: Background and Related Workmentioning

confidence: 99%

Section: Design Spacementioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 3 more Smart Citations

Search for Optimal Systolic Arrays: A Comprehensive Automated Exploration Framework and Lessons Learned

Wang

Cong

2021

Preprint

View full text Add to dashboard Cite

show abstract

Near-optimal multi-accelerator architectures for predictive maintenance at the edge

Koraei

Cebrian

Jahre

2023

Future Generation Computer Systems

View full text Add to dashboard Cite

Computationally Efficient DNN Mapping Search Heuristic using Deep Reinforcement Learning

Bakshi,

Johnsson

2023

ACM Trans. Embed. Comput. Syst.

View full text Add to dashboard Cite

In this work, we present a computationally efficient Reinforcement Learning mapping search heuristic for finding high quality mappings for N-dimensional convolution loops that uses a computationally inexpensive reward function based on potential data reuse of operands to guide the search process. We also present a RL state representation generalizable to N-dimensional convolution loops, and a state representation parsing strategy ensuring that only valid mappings are evaluated for quality. Our RL search heuristic is applicable to multi-core systems with a memory hierarchy. We show that our RL based search heuristic for a range of 3D convolution layers, at significantly lower computational expense than random search, generally yields mappings with lower Energy-Delay Product (EDP) for an architecture with multiple processing elements with shared memory connected to DRAM. Our evaluation results demonstrated across 19 3D convolution layers, shows that our RL method performed only an average 11.24% of the operations of that of Timeloop’s random search for assessing same number of valid mappings. The mappings found using Timeloop had an average 12.51% higher EDP compared to lowest EDP mapping found using our RL method. Further, the lowest EDP mappings found using our method had an average only 4.69× higher EDP than the theoretical lower bound EDP, with the best case being only 1.29× higher.

show abstract

CoSA: Scheduling by Constrained Optimization for Spatial Accelerators

Cited by 3 publications

References 68 publications

Search for Optimal Systolic Arrays: A Comprehensive Automated Exploration Framework and Lessons Learned

Search for Optimal Systolic Arrays: A Comprehensive Automated Exploration Framework and Lessons Learned

Near-optimal multi-accelerator architectures for predictive maintenance at the edge

Computationally Efficient DNN Mapping Search Heuristic using Deep Reinforcement Learning

Contact Info

Product

Resources

About