2019 IEEE International Solid- State Circuits Conference - (ISSCC) 2019
DOI: 10.1109/isscc.2019.8662302
|View full text |Cite
|
Sign up to set email alerts
|

7.7 LNPU: A 25.3TFLOPS/W Sparse Deep-Neural-Network Learning Processor with Fine-Grained Mixed Precision of FP8-FP16

Abstract: The massive computational costs associated with large language model (LLM) pretraining have spurred great interest in reduced-precision floating-point representations to accelerate the process. As a result, the BrainFloat16 (BF16) precision has become the de facto standard for LLM training, with hardware support included in recent accelerators. This trend has gone even further in the latest processors, where FP8 has recently been introduced. However, prior experience with FP16, which was found to be less stabl… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
1
1

Citation Types

0
56
0

Year Published

2019
2019
2023
2023

Publication Types

Select...
5
3

Relationship

0
8

Authors

Journals

citations
Cited by 129 publications
(60 citation statements)
references
References 15 publications
0
56
0
Order By: Relevance
“…Computationally, this is motivated by biology where memory and compute are interleaved and global movement of data is minimal. New neuromorphic architectures for deep learning (Shin et al, 2017; Lee et al, 2018, 2019) and reinforcement learning (Amaravati et al, 2018a,b; Cao et al, 2019; Kim et al, 2019) seek to apply this constraint to avoid the communication overhead. However, as we have noted earlier, BP violates this constraint.…”
Section: Resultsmentioning
confidence: 99%
“…Computationally, this is motivated by biology where memory and compute are interleaved and global movement of data is minimal. New neuromorphic architectures for deep learning (Shin et al, 2017; Lee et al, 2018, 2019) and reinforcement learning (Amaravati et al, 2018a,b; Cao et al, 2019; Kim et al, 2019) seek to apply this constraint to avoid the communication overhead. However, as we have noted earlier, BP violates this constraint.…”
Section: Resultsmentioning
confidence: 99%
“…Although developed for general DNNs, the accelerators shown in Table II can efficiently realize portable smart DL-based healthcare IoT and PoC systems for processing image-based (medical imaging) or dynamic sequential medical data types (such as EEG and ECG). For instance, the table shows a few exemplar healthcare and biomedical applications that are picked based on the demonstrated capacity of these accelerators to run (or train [55]) various well-known CNN architectures such as VGG, ResNet, MobileNet, AlexNet, Inception, or RNNs such as LSTMs, or combined CNN-RNNs. It is worth noting that most of the available accelerators are intended for CNN inference, while only some [56]- [58] also include recurrent connections for RNN acceleration.…”
Section: ) Edge-ai Dnn Accelerators Suitable For Biomedical Applicationsmentioning
confidence: 99%
“…Neural processor [70] is another CNN accelerator that is shown to be able to run Inception V3 CNN, which can be used for skin cancer detection [11] at the edge. LNPU [55] is the only CNN accelerator shown in Table II, which unlike the others can perform both learning and inference of a deep network such as AlexNet and VGG-16, for applications including on edge medical imaging [32] and cancer diagnosis [62].…”
Section: ) Edge-ai Dnn Accelerators Suitable For Biomedical Applicationsmentioning
confidence: 99%
See 1 more Smart Citation
“…Even BNNs that use binarized weights during inference require floating-point computations for their training [6]. Several studies on training hardware using the backpropagation algorithm have recently been reported [10]- [12]. Training using a hardware accelerator [10], [11] shows insufficient improvement in terms of latency compared to the improvement obtained for inference carried out on the same accelerator.…”
Section: Introductionmentioning
confidence: 99%