Timeloop: A Systematic Approach to DNN Accelerator Evaluation

Parashar, Angshuman; Raina, Priyanka; Shao, Yakun Sophia; Chen, Yu‐Hsin; Ying, Victor A.; Mukkara, Anurag; Venkatesan, Rangharajan; Khailany, Brucek; Keckler, Stephen W.; Emer, Joel

doi:10.1109/ispass.2019.00042

Cited by 375 publications

(296 citation statements)

References 29 publications

Supporting

Mentioning

296

Contrasting

Order By: Relevance

“…DimHW tiling is 6.5× faster to complete, because it only requires 128 memcpys of 16K elements to completely tile the input, compared to 262K memcpys of 8 elements. The effect of a different tiling strategy on the overall operation is harder to predict (but can be estimated with analytical models like Timeloop [45] or MAESTRO [26]). For element-wise operations, tiling strategy has next to no effect; for operations whose performance depends on exploiting data reuse, changing tiling shape may impact overall runtime.…”

Section: Tiling Optimizermentioning

confidence: 99%

“…Some are end-to-end systems, like Ten-sorFlow [1] or TVM [9], but they either lack simulation support or require detailed pipeline models or RTL. Other tools focus on exploring dataflows and efficiently map DNN kernels to FPGAS or ASICs [45,61,65,72,76,77]. These often implement a component library or templated designs for hardware optimization, but with a heavy focus on optimizing the accelerator, they cannot evaluate networks end-to-end, leaving a lot of design opportunities unexplored.…”

Section: Related Workmentioning

confidence: 99%

See 1 more Smart Citation

Smaug

Yao

Bhardwaj

et al. 2020

ACM Trans. Archit. Code Optim.

View full text Add to dashboard Cite

In recent years, there has been tremendous advances in hardware acceleration of deep neural networks. However, most of the research has focused on optimizing accelerator microarchitecture for higher performance and energy efficiency on a per-layer basis. We find that for overall single-batch inference latency, the accelerator may only make up 25–40%, with the rest spent on data movement and in the deep learning software framework. Thus far, it has been very difficult to study end-to-end DNN performance during early stage design (before RTL is available), because there are no existing DNN frameworks that support end-to-end simulation with easy custom hardware accelerator integration. To address this gap in research infrastructure, we present SMAUG, the first DNN framework that is purpose-built for simulation of end-to-end deep learning applications. SMAUG offers researchers a wide range of capabilities for evaluating DNN workloads, from diverse network topologies to easy accelerator modeling and SoC integration. To demonstrate the power and value of SMAUG, we present case studies that show how we can optimize overall performance and energy efficiency for up to 1.8×–5× speedup over a baseline system, without changing any part of the accelerator microarchitecture, as well as show how SMAUG can tune an SoC for a camera-powered deep learning pipeline.

show abstract

Section: Tiling Optimizermentioning

confidence: 99%

Section: Related Workmentioning

confidence: 99%

Smaug

Yao

Bhardwaj

et al. 2020

ACM Trans. Archit. Code Optim.

View full text Add to dashboard Cite

show abstract

“…For designing FPGA-based DNN accelerators, current practice usually relies on roofline models [10] or customized analytical tools [13,16] to estimate the achievable performance. For ASIC-based accelerators, recently published designs [21,34,35] introduce various performance prediction methods. Eyeriss [21] proposes an energy model for capturing the energy overhead of the customized memory and computation units and a delay model that simplifies the latency calculation.…”

Section: Background and Related Workmentioning

confidence: 99%

AutoDNNchip

Zhang

Hao

et al. 2020

Proceedings of the 2020 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays

View full text Add to dashboard Cite

Recent breakthroughs in Deep Neural Networks (DNNs) have fueled a growing demand for domain-specific hardware accelerators (i.e., DNN chips). However, designing DNN chips is non-trivial because: (1) mainstream DNNs have millions of parameters and operations; (2) the design space is large due to the numerous design choices of dataflows, processing elements, memory hierarchy, etc.; and (3) an algorithm/hardware co-design is needed to allow the same DNN functionality to have a different decomposition, which would require different hardware IPs that correspond to dramatically different performance/energy/area tradeoffs. Therefore, DNN chips often take months to years to design and require a large team of cross-disciplinary experts. To enable fast and effective DNN chip design, we propose AutoDNNchip − a DNN chip generator that can automatically generate both FPGA-and ASIC-based DNN chip implementation (i.e., synthesizable RTL code with optimized algorithm-to-hardware mapping (i.e., dataflow) ) given DNNs from machine learning frameworks (e.g., PyTorch) for a designated application and dataset without humans in the loop. Specifically, AutoDNNchip consists of two integrated enablers: (1) a Chip Predictor, built on top of a graph-based accelerator representation, which can accurately and efficiently predict a DNN accelerator's energy, throughput, latency, and area based on the DNN model parameters, hardware configuration, technology-based IPs, and platform constraints; and (2) a Chip Builder, which can automatically explore the design space of DNN chips (including IP selection, block configuration, resource balance, etc.), optimize chip design via the Chip Predictor, and then generate synthesizable RTL code with optimized dataflows to achieve the target design metrics. Experimental results show that our Chip Predictor's predicted performance differs from real-measured ones by <10% when validated using 15 DNN models and 4 platforms (edge-FPGA/TPU/GPU and ASIC). Furthermore, both the FPGA-and ASIC-based DNN accelerators generated by our AutoDNNchip can achieve better (up to 3.86× improvement) performance than that of expert-crafted state-of-the-art accelerators, showing the effectiveness of AutoDNNchip. Our open-source code can be found at https://github.com/RICE-EIC/AutoDNNchip.git.

show abstract

“…However, the roofline model lack fine-grained estimation and customized models are not general as desired. Timeloop [21] and Eyeriss [22] use for and parallel-for to describe the temporal and spatial mapping of DNN accelerators. Specifically, Timeloop obtains the number of memory accesses and estimates the latency by calculating the maximum isolated execution cycle across all hardware IPs based on a double-buffering assumption.…”

Section: Introductionmentioning

confidence: 99%

DNN-Chip Predictor: An Analytical Performance Predictor for DNN Accelerators with Various Dataflows and Hardware Architectures

Zhao

Wang

et al. 2020

ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

View full text Add to dashboard Cite

The recent breakthroughs in deep neural networks (DNNs) have spurred a tremendously increased demand for DNN accelerators. However, designing DNN accelerators is non-trivial as it often takes months/years and requires cross-disciplinary knowledge. To enable fast and effective DNN accelerator development, we propose DNN-Chip Predictor, an analytical performance predictor which can accurately predict DNN accelerators' energy, throughput, and latency prior to their actual implementation. Our Predictor features two highlights: (1) its analytical performance formulation of DNN ASIC/FPGA accelerators facilitates fast design space exploration and optimization; and (2) it supports DNN accelerators with different algorithm-to-hardware mapping methods (i.e., dataflows) and hardware architectures. Experiment results based on 2 DNN models and 3 different ASIC/FPGA implementations show that our DNN-Chip Predictor's predicted performance differs from those of chip measurements of FPGA/ASIC implementation by no more than 17.66% when using different DNN models, hardware architectures, and dataflows. We will release code upon acceptance.

show abstract

Timeloop: A Systematic Approach to DNN Accelerator Evaluation

Cited by 375 publications

References 29 publications

Smaug

Smaug

AutoDNNchip

DNN-Chip Predictor: An Analytical Performance Predictor for DNN Accelerators with Various Dataflows and Hardware Architectures

Contact Info

Product

Resources

About