A 3.0 TFLOPS 0.62V Scalable Processor Core for High Compute Utilization AI Training and Inference

Oh, Jinwook; Lee, Sae Kyu; Kang, Mingu; Ziegler, Matthew M.; Silberman, J. A.; Agrawal, Ankur; Venkataramani, Swagath; Fleischer, Bruce; Guillorn, Michael A.; Choi, Jungwook; Wang, Wei; Mueller, Silvia Melitta; Ben-Yehuda, Shimon; Bonanno, James; Cao, Nianwen; Casatuta, Robert; Chen, Chia-Yu; Cohen, Matt; Erez, Ophir; Fox, Thomas W.; Gristede, George; Haynie, Howard; Іванов, В.О.; Koswatta, Siyu; Lo, Shih-Hsien; Lutz, Martin; Maier, Gary; Mesh, Alex; Nustov, Yevgeny; Rider, Scot; Schaal, Marcel; Scheuermann, M.; Sun, Xiao; Wang, Naigang; Yee, Fanchieh; Zhou, Ching; Shah, Vinay; Curran, Brian; Srinivasan, Vijayalakshmi; Lu, Pong‐Fei; Shukla, Sunil; Gopalakrishnan, Kailash; Chang, Leland

doi:10.1109/vlsicircuits18222.2020.9162917

Cited by 30 publications

(6 citation statements)

References 0 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…The DNN training may not be the only answer for AI to reach human intelligence, but it will lead to the harmonious coexistence of AI and human beings. [3,4,10] • Mixed-mode Computing [5,8,9] • Stochastic Rounding Circuit [29] • Binary BW Computing MAC [34] • FP-FXP Fused Multiply-Add Unit [16,18,19,21,22,24,26,31] PE/Circuit Level Feature…”

Section: Discussionmentioning

confidence: 99%

“…[29] [26,31,51,78] Sparsity Exploitation [17][18][19] was proposed to unify the data representation method of both input operand and accumulation. Flexpoint [55] tried to substitute FP with FXP representation using a shared exponent management algorithm together for simplification of MAC design, but it failed to reduce the required bit-precision to less than 16-bit.…”

Section: A New Number Representationmentioning

confidence: 99%

“…Training processors reported from the industry [14][15][16][17][18][19] mainly focused on generalpurpose DNN training. On the contrary, training processors from the academy [20][21][22][23][24][25][26][27][28][29][30][31][32][33][34][35] mainly targeted local training which fine-tunes DNN to be more accurate in user-specific datasets.…”

Section: A Applications and Examples Of Training Processor 1) Applica...mentioning

confidence: 99%

“…The majority of processors [14][15][16][17][18][19][20][21][22][23][24][25][26][27][28][29][30][31][32] adopted the homogeneous core design and they were programmable to be applied in various types of networks and applications. Moreover, they also emphasized a new FP-number-representation-based PE design by utilizing a new number representation method and adopted precisionconfigurable MAC for energy-efficient inference and training [15-19, 21, 22, 24, 26].…”

Section: ) Examples Of Training Processor Designmentioning

confidence: 99%

See 3 more Smart Citations

Energy-Efficient DNN Training Processors on Micro-AI Systems

Han

Kang

Kim

et al. 2022

IEEE Open J. Solid-State Circuits Soc.

View full text Add to dashboard Cite

Many edge/mobile devices are now able to utilize deep neural networks (DNNs) thanks to the development of mobile DNN accelerators. Mobile DNN accelerators overcame the problems of limited computing resources and battery capacity by realizing energy-efficient inference. However, its passive behavior makes it difficult for DNN to provide active customization for individual users or its service environment. The importance of onchip training is rising more and more to provide active interaction between DNN processors and ever-changing surroundings or conditions. Despite its advantages, the DNN training has more constraints than the inference such that it was considered impractical to be realized on mobile/edge devices. Recently, there are many trials to realize mobile DNN training, and a number of prior works will be summarized. Firstly, it arranges the new challenges of the DNN accelerator induced by training functionality and discusses new hardware features related to the challenges. Secondly, it explains algorithm-hardware cooptimization methods and explains why it becomes mainstream in mobile DNN training research. Thirdly, it compares the main differences between the conventional inference accelerators and recent training processors. Finally, the conclusion is made by proposing the future directions of the DNN training processor in micro-AI systems.

show abstract

Section: Discussionmentioning

confidence: 99%

Section: A New Number Representationmentioning

confidence: 99%

Section: A Applications and Examples Of Training Processor 1) Applica...mentioning

confidence: 99%

Section: ) Examples Of Training Processor Designmentioning

confidence: 99%

See 2 more Smart Citations

Energy-Efficient DNN Training Processors on Micro-AI Systems

Han

Kang

Kim

et al. 2022

IEEE Open J. Solid-State Circuits Soc.

View full text Add to dashboard Cite

show abstract

“…Each compute array is specialized for certain type of AI operations, allowing for higher circuit customization and density as well as lower latency and power since both compute arrays may not be in use. It exploits a systolic dataflow architecture similar to designs in [4,33,60,65,92]. Discrete synchronization hardware and micro-instructions in the various engines allow for synchronization of the operations within the accelerator and with the general purpose core that initiates the execution of NNPA instructions.…”

Section: Computementioning

confidence: 99%

AI accelerator on IBM Telum processor

Lichtenau

Buyuktosunoglu²,

Bertran³

et al. 2022

Proceedings of the 49th Annual International Symposium on Computer Architecture

View full text Add to dashboard Cite

IBM Telum is the next generation processor chip for IBM Z and LinuxONE systems. The Telum design is focused on enterprise class workloads and it achieves over 40% per socket performance growth compared to IBM z15. The IBM Telum is the first server-class chip with a dedicated on-chip AI accelerator that enables clients to gain real time insights from their data as it is getting processed.Seamlessly infusing AI in all enterprise workloads is highly desirable to get real business insight on every transaction as well as to improve IT operation, security, and data privacy. While it would undeniably provide significant additional value, its application in practice is often accompanied by hurdles from low throughput if run on-platform to security concerns and inconsistent latency if run off-platform. The IBM Telum chip introduces an on-chip AI accelerator that provides consistent low latency and high throughput (over 200 TFLOPS in 32 chip system) inference capacity usable by all threads. The accelerator is memory coherent and directly connected to the fabric like any other general-purpose core to support low latency inference while meeting the system's transaction rate. A scalable architecture providing transparent access to AI accelerator functions via a non-privileged general-purpose core * This paper is part of the Industry Track of ISCA 2022's program.

show abstract

HNPU-V1: An Adaptive DNN Training Processor Utilizing Stochastic Dynamic Fixed-Point and Active Bit-Precision Searching

Han

Yoo

2023

On-Chip Training NPU - Algorithm, Architecture and SoC Design

View full text Add to dashboard Cite

A 3.0 TFLOPS 0.62V Scalable Processor Core for High Compute Utilization AI Training and Inference

Cited by 30 publications

References 0 publications

Energy-Efficient DNN Training Processors on Micro-AI Systems

Energy-Efficient DNN Training Processors on Micro-AI Systems

AI accelerator on IBM Telum processor

HNPU-V1: An Adaptive DNN Training Processor Utilizing Stochastic Dynamic Fixed-Point and Active Bit-Precision Searching

Contact Info

Product

Resources

About