FlexBlock: A Flexible DNN Training Accelerator with Multi-Mode Block Floating Point Support

Noh, Seock-Hwan; Koo, Jahyun; Lee, Seung–Hyun; Park, Jongse; Kung, Jaeha

doi:10.48550/arxiv.2203.06673

Cited by 1 publication

(2 citation statements)

References 44 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…The prior work on designing an energy-efficient DNN accelerator mostly focus on Conv/FC operations [26], [27], [34], [36], while there is lack of research on making the BN hardware more efficient. One of the most effective ways of improving hardware efficiency of a processing unit is reducing the bit-precision.…”

Section: A Compute Units For Bn Layersmentioning

confidence: 99%

“…With the mixed-precision training, multiplications are performed in FP16 while accumulations are performed in FP32. Unconventional data representations suited at DNN training, such as bfloat16 [13], [19] and block floating point (BFP) representation [8], [26], have been studied as well. To enable DNN training at much lower hardware cost, possibly at the edge, researchers have explored low-precision training using FP8 (i.e., 8-bit floating point) with the support of squeeze and shift operations [4] or exponent biases [9] to cover a wide dynamic range of the original data distribution.…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

LightNorm: Area and Energy-Efficient Batch Normalization Hardware for On-Device DNN Training

Noh¹,

Junsang²,

Park³

et al. 2022

Preprint

View full text Add to dashboard Cite

When training early-stage deep neural networks (DNNs), generating intermediate features via convolution or linear layers occupied most of the execution time. Accordingly, extensive research has been done to reduce the computational burden of the convolution or linear layers. In recent mobile-friendly DNNs, however, the relative number of operations involved in processing these layers has significantly reduced. As a result, the proportion of the execution time of other layers, such as batch normalization layers, has increased. Thus, in this work, we conduct a detailed analysis of the batch normalization layer to efficiently reduce the runtime overhead in the batch normalization process. Backed up by the thorough analysis, we present an extremely efficient batch normalization, named LightNorm, and its associated hardware module. In more detail, we fuse three approximation techniques that are i) low bitprecision, ii) range batch normalization, and iii) block floating point. All these approximate techniques are carefully utilized not only to maintain the statistics of intermediate feature maps, but also to minimize the offchip memory accesses. By using the proposed LightNorm hardware, we can achieve significant area and energy savings during the DNN training without hurting the training accuracy. This makes the proposed hardware a great candidate for the on-device training.

show abstract