RaPiD: AI Accelerator for Ultra-low Precision Training and Inference

Venkataramani, Swagath; Srinivasan, Vijayalakshmi; Wang, Wei; Sen, Sanchari; Zhang, Jintao; Agrawal, Ankur; Kar, Monodeep; Jain, Shubham; Mannari, Alberto; Tran, Hoang Vinh; Li, Yulong; Ogawa, Eri; Ishizaki, Kazuaki; Inoue, Hiroshi; Schaal, Marcel; Serrano, Mauricio J.; Choi, Jungwook; Sun, Xiao; Wang, Naigang; Chen, Chia-Yu; Allain, Allison; Bonano, James; Cao, Nianwen; Casatuta, Robert; Cohen, Matthew; Fleischer, Bruce; Guillorn, Michael A.; Haynie, Howard; Jung, Jinwook; Kang, Mingu; Kim, Kyu Hyun; Koswatta, Siyu; Lee, Saekyu; Lutz, Martin; Mueller, Silvia Melitta; Oh, Jinwook; Ranjan, Ashish; Ren, Zhibin; Rider, Scot; Schelm, Kerstin; Scheuermann, M.; Silberman, J. A.; Yang, Jie; Zalani, Vidhi; Zhang, Xin; Zhou, Ching; Ziegler, Matt; Shah, Vinay; O’Hara, Maureen; Lu, Pong‐Fei; Curran, Brian; Shukla, Sunil; Chang, Leland; Gopalakrishnan, Kailash

doi:10.1109/isca52012.2021.00021

Cited by 54 publications

(15 citation statements)

References 51 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Compared to [10], [31], FlexBlock allows more fine-grained blocking of sub-tensors to support variable precisions for accelerating the training process as discussed in Section VII-B. In the very recent work on low-precision training [1], [47], [58], [65], [67], 8-bit floating point (FP8 or HFP8) has been used to train DNNs with a little accuracy loss on a wide spectrum of benchmarks. However, the hardware associated with FP8 training uses specific mantissa and exponent bits for its maximum energy efficiency, which lacks flexibility.…”

Section: B Reduced Precision During Dnn Trainingmentioning

confidence: 99%

FlexBlock: A Flexible DNN Training Accelerator with Multi-Mode Block Floating Point Support

Noh¹,

Koo²,

Lee³

et al. 2022

Preprint

View full text Add to dashboard Cite

Training deep neural networks (DNNs) is a computationally expensive job, which can take weeks or months even with high performance GPUs. As a remedy for this challenge, community has started exploring the use of more efficient data representations in the training process, e.g., block floating point (BFP). However, prior work on BFP-based DNN accelerators rely on a specific BFP representation making them less versatile. This paper builds upon an algorithmic observation that we can accelerate the training by leveraging multiple BFP precisions without compromising the finally achieved accuracy. Backed up by this algorithmic opportunity, we develop a flexible DNN training accelerator, dubbed FlexBlock, which supports three different BFP precision modes, possibly different among activation, weight, and gradient tensors. While several prior works proposed such multi-precision support for DNN accelerators, not only do they focus only on the inference, but also their core utilization is suboptimal at a fixed precision and specific layer types when the training is considered. Instead, FlexBlock is designed in such a way that high core utilization is achievable for i) various layer types, and ii) three BFP precisions by mapping data in a hierarchical manner to its compute units. We evaluate the effectiveness of FlexBlock architecture using well-known DNNs on CIFAR, ImageNet and WMT14 datasets. As a result, training in FlexBlock significantly improves the training speed by 1.5∼5.3× and the energy efficiency by 2.4∼7.0× on average compared to other training accelerators and incurs marginal accuracy loss compared to full-precision training.

show abstract

Section: B Reduced Precision During Dnn Trainingmentioning

confidence: 99%

FlexBlock: A Flexible DNN Training Accelerator with Multi-Mode Block Floating Point Support

Noh¹,

Koo²,

Lee³

et al. 2022

Preprint

View full text Add to dashboard Cite

show abstract

“…High-Performance Application Specific Accelerators. The methodology proposed in this work through BiSon-e features even more flexibility than high-performance application-specific accelerators like [14,48]. For example, [14], represents a state-of-the-art DNN accelerator for mobile devices, featuring 192 processing elements and line buffers for a total area of 36mm2 on the TSMC 65nm technology node.…”

Section: Related Workmentioning

confidence: 99%

BiSon-e: a lightweight and high-performance accelerator for narrow integer linear algebra computing on the edge

Reggiani

Lazo

Figueras

et al. 2022

Proceedings of the 27th ACM International Conference on Architectural Support for Programming Languages and Operating Systems

View full text Add to dashboard Cite

Linear algebra computational kernels based on byte and sub-byte integer data formats are at the base of many classes of applications, ranging from Deep Learning to Pattern Matching. Porting the computation of these applications from cloud to edge and mobile devices would enable significant improvements in terms of security, safety, and energy efficiency. However, despite their low memory and energy demands, their intrinsically high computational intensity makes the execution of these workloads challenging on highly resource-constrained devices. In this paper, we present BiSon-e, a novel RISC-V based architecture that accelerates linear algebra kernels based on narrow integer computations on edge processors by performing Single Instruction Multiple Data (SIMD) operations on off-the-shelf scalar Functional Units (FUs). Our novel architecture is built upon the binary segmentation technique, which allows to significantly reduce the memory footprint and the arithmetic intensity of linear algebra kernels requiring narrow data sizes. We integrate BiSon-e into a complete System-on-Chip (SoC) based on RISC-V, synthesized and Place&Routed in 65nm and 22nm technologies, introducing a negligible 0.07% area overhead with respect to the baseline architecture. Our experimental evaluation shows that, when computing the Convolution and Fully-Connected layers of the AlexNet and VGG-16 Convolutional Neural Networks (CNNs) with 8-, 4-, and 2-bit, our solution gains up to 5.6×, 13.9× and 24× in execution time compared to the scalar implementation of a single RISC-V core, and improves the energy efficiency of string matching tasks by 5× when compared to a RISC-V-based Vector Processing Unit (VPU).

show abstract

“…A spatial compute array is the key component in many popular low-cost CNN accelerators [50,58,97,[113][114][115][116][117][118][119][120][121][122][123].…”

Section: Spatial Architectures For Cnn Inferencementioning

confidence: 99%

EcoFlow: Efficient Convolutional Dataflows for Low-Power Neural Network Accelerators

Orosa¹,

Koppula²,

Umuroglu³

et al. 2022

Preprint

View full text Add to dashboard Cite

Dilated and transposed convolutions are widely used in modern convolutional neural networks (CNNs). These kernels are used extensively during CNN training and inference of applications such as image segmentation and high-resolution image generation. Although these kernels have grown in popularity, they stress current compute systems due to their high memory intensity, exascale compute demands, and large energy consumption. We find that commonly-used low-power CNN inference accelerators based on spatial architectures are not optimized for both of these convolutional kernels. Dilated and transposed convolutions introduce significant zero padding when mapped to the underlying spatial architecture, significantly degrading performance and energy efficiency. Existing approaches that address this issue require significant design changes to the otherwise simple, efficient, and well-adopted architectures used to compute direct convolutions. To address this challenge, we propose EcoFlow, a new set of dataflows and mapping algorithms for dilated and transposed convolutions. These algorithms are tailored to execute efficiently on existing low-cost, small-scale spatial architectures and requires minimal changes to the network-on-chip of existing accelerators. At its core, EcoFlow eliminates zero padding through careful dataflow orchestration and data mapping tailored to the spatial architecture. EcoFlow enables flexible and high-performance transpose and dilated convolutions on architectures that are otherwise optimized for CNN inference. We evaluate the efficiency of our dataflows on CNN training workloads and Generative Adversarial Network (GAN) training workloads. Experiments in our new cycle-accurate spatial architecture simulator show that EcoFlow 1) reduces end-to-end CNN training time between 7-85%, and 2) improves end-to-end GAN training performance between 29-42%, compared to state-of-the-art CNN inference accelerators. [Open-Source Artifact]We open-source both our Spatial Architecture Simulator for Machine Learning (SASiML) and the SASiML compiler to help enable the development of new dataflows and high-accuracy simulation environments for new spatial architectures and dataflows. This can be freely found at https://github.com/CMU-SAFARI/sasiml.

show abstract

RaPiD: AI Accelerator for Ultra-low Precision Training and Inference

Cited by 54 publications

References 51 publications

FlexBlock: A Flexible DNN Training Accelerator with Multi-Mode Block Floating Point Support

FlexBlock: A Flexible DNN Training Accelerator with Multi-Mode Block Floating Point Support

BiSon-e: a lightweight and high-performance accelerator for narrow integer linear algebra computing on the edge

EcoFlow: Efficient Convolutional Dataflows for Low-Power Neural Network Accelerators

Contact Info

Product

Resources

About