7.7 LNPU: A 25.3TFLOPS/W Sparse Deep-Neural-Network Learning Processor with Fine-Grained Mixed Precision of FP8-FP16

Lee, Jinsu; Lee, Juhyoung; Han, Donghyeon; Park, Gwangtae; Yoo, Hoi‐Jun

doi:10.1109/isscc.2019.8662302

Cited by 129 publications

(60 citation statements)

References 15 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Computationally, this is motivated by biology where memory and compute are interleaved and global movement of data is minimal. New neuromorphic architectures for deep learning (Shin et al, 2017; Lee et al, 2018, 2019) and reinforcement learning (Amaravati et al, 2018a,b; Cao et al, 2019; Kim et al, 2019) seek to apply this constraint to avoid the communication overhead. However, as we have noted earlier, BP violates this constraint.…”

Section: Resultsmentioning

confidence: 99%

Direct Feedback Alignment With Sparse Connections for Local Learning

Crafton¹,

Parihar²,

Gebhardt³

et al. 2019

Front. Neurosci.

View full text Add to dashboard Cite

Recent advances in deep neural networks (DNNs) owe their success to training algorithms that use backpropagation and gradient-descent. Backpropagation, while highly effective on von Neumann architectures, becomes inefficient when scaling to large networks. Commonly referred to as the weight transport problem, each neuron's dependence on the weights and errors located deeper in the network require exhaustive data movement which presents a key problem in enhancing the performance and energy-efficiency of machine-learning hardware. In this work, we propose a bio-plausible alternative to backpropagation drawing from advances in feedback alignment algorithms in which the error computation at a single synapse reduces to the product of three scalar values. Using a sparse feedback matrix, we show that a neuron needs only a fraction of the information previously used by the feedback alignment algorithms. Consequently, memory and compute can be partitioned and distributed whichever way produces the most efficient forward pass so long as a single error can be delivered to each neuron. We evaluate our algorithm using standard datasets, including ImageNet, to address the concern of scaling to challenging problems. Our results show orders of magnitude improvement in data movement and 2× improvement in multiply-and-accumulate operations over backpropagation. Like previous work, we observe that any variant of feedback alignment suffers significant losses in classification accuracy on deep convolutional neural networks. By transferring trained convolutional layers and training the fully connected layers using direct feedback alignment, we demonstrate that direct feedback alignment can obtain results competitive with backpropagation. Furthermore, we observe that using an extremely sparse feedback matrix, rather than a dense one, results in a small accuracy drop while yielding hardware advantages. All the code and results are available under https://github.com/bcrafton/ssdfa .

show abstract

Section: Resultsmentioning

confidence: 99%

Direct Feedback Alignment With Sparse Connections for Local Learning

Crafton¹,

Parihar²,

Gebhardt³

et al. 2019

Front. Neurosci.

View full text Add to dashboard Cite

show abstract

“…Although developed for general DNNs, the accelerators shown in Table II can efficiently realize portable smart DL-based healthcare IoT and PoC systems for processing image-based (medical imaging) or dynamic sequential medical data types (such as EEG and ECG). For instance, the table shows a few exemplar healthcare and biomedical applications that are picked based on the demonstrated capacity of these accelerators to run (or train [55]) various well-known CNN architectures such as VGG, ResNet, MobileNet, AlexNet, Inception, or RNNs such as LSTMs, or combined CNN-RNNs. It is worth noting that most of the available accelerators are intended for CNN inference, while only some [56]- [58] also include recurrent connections for RNN acceleration.…”

Section: ) Edge-ai Dnn Accelerators Suitable For Biomedical Applicationsmentioning

confidence: 99%

“…Neural processor [70] is another CNN accelerator that is shown to be able to run Inception V3 CNN, which can be used for skin cancer detection [11] at the edge. LNPU [55] is the only CNN accelerator shown in Table II, which unlike the others can perform both learning and inference of a deep network such as AlexNet and VGG-16, for applications including on edge medical imaging [32] and cancer diagnosis [62].…”

Section: ) Edge-ai Dnn Accelerators Suitable For Biomedical Applicationsmentioning

confidence: 99%

“…Most of the accelerator chips, such as those discussed in Table II, use similar optimization strategies involving reduced precision arithmetic [55], [58], [66], [68] to improve computational throughput. This is typically combined with architecturallevel enhancements [56], [57], [59], [61], [70] to either reduce data movement (using in-or near-memory computing), heightened parallelism, or both.…”

Section: ) Edge-ai Dnn Accelerators Suitable For Biomedical Applicationsmentioning

confidence: 99%

See 1 more Smart Citation

Hardware Implementation of Deep Network Accelerators Towards Healthcare and Biomedical Applications

Azghadi

Lammie

Eshraghian

et al. 2020

IEEE Trans. Biomed. Circuits Syst.

150

View full text Add to dashboard Cite

The advent of dedicated Deep Learning (DL) accelerators and neuromorphic processors has brought on new opportunities for applying both Deep and Spiking Neural Network (SNN) algorithms to healthcare and biomedical applications at the edge. This can facilitate the advancement of medical Internet of Things (IoT) systems and Point of Care (PoC) devices. In this paper, we provide a tutorial describing how various technologies including emerging memristive devices, Field Programmable Gate Arrays (FPGAs), and Complementary Metal Oxide Semiconductor (CMOS) can be used to develop efficient DL accelerators to solve a wide variety of diagnostic, pattern recognition, and signal processing problems in healthcare. Furthermore, we explore how spiking neuromorphic processors can complement their DL counterparts for processing biomedical signals. The tutorial is augmented with case studies of the vast literature on neural network and neuromorphic hardware as applied to the healthcare domain. We benchmark various hardware platforms by performing a sensor fusion signal processing task combining electromyography (EMG) signals with computer vision. Comparisons are made between dedicated neuromorphic processors and embedded AI accelerators in terms of inference latency and energy. Finally, we provide our analysis of the field and share a perspective on the advantages, disadvantages, challenges, and opportunities that various accelerators and neuromorphic processors introduce to healthcare and biomedical domains.

show abstract

“…Even BNNs that use binarized weights during inference require floating-point computations for their training [6]. Several studies on training hardware using the backpropagation algorithm have recently been reported [10]- [12]. Training using a hardware accelerator [10], [11] shows insufficient improvement in terms of latency compared to the improvement obtained for inference carried out on the same accelerator.…”

Section: Introductionmentioning

confidence: 99%

Training Hardware for Binarized Convolutional Neural Network Based on CMOS Invertible Logic

et al. 2020

View full text Add to dashboard Cite

In this paper, we implement fast and power-efficient training hardware for convolutional neural networks (CNNs) based on CMOS invertible logic. The backpropagation algorithm is generally hard to implement in hardware because it requires high-precision floating-point arithmetic. Even though parameters of CNNs can be represented by fixed points or even binary during inference, it is still represented by floating points during training. Our hardware uses low-precision data representation for both inference and training. For hardware implementation, we exploit CMOS invertible logic for training. The use of invertible logic enables logic circuits to compute probabilistic bidirectional operation (forward and backward modes) and can be implemented by stochastic computing. The proposed hardware obtains parameters of neural networks such as weights directly from given data (an input feature map and a true label) without backpropagation. For performance evaluation, the proposed hardware is implemented on an FPGA and trains a binarized 2-layer convolutional neural network model using a modified MNIST dataset. This implementation shows an energy efficiency improvement of approximately 134x compared to that of a CPU implementation that executes the training of the same model as that used in the proposed hardware. Training on the proposed hardware is approximately 40x faster than training on the CPU using the backpropagation algorithm while maintaining almost the same cognition accuracy. INDEX TERMS CMOS invertible logic, Neural network, Stochastic computing.

show abstract

7.7 LNPU: A 25.3TFLOPS/W Sparse Deep-Neural-Network Learning Processor with Fine-Grained Mixed Precision of FP8-FP16

Cited by 129 publications

References 15 publications

Direct Feedback Alignment With Sparse Connections for Local Learning

Direct Feedback Alignment With Sparse Connections for Local Learning

Hardware Implementation of Deep Network Accelerators Towards Healthcare and Biomedical Applications

Training Hardware for Binarized Convolutional Neural Network Based on CMOS Invertible Logic

Contact Info

Product

Resources

About