HNPU: An Adaptive DNN Training Processor Utilizing Stochastic Dynamic Fixed-Point and Active Bit-Precision Searching

Han, Donghyeon; Im, Dongseok; Park, Gwangtae; Kim, Youngwoo; Song, Seokchan; Lee, Juhyoung; Yoo, Hoi‐Jun

doi:10.1109/jssc.2021.3066400

Cited by 42 publications

(32 citation statements)

References 18 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Zero bit-slice skipping architecture [6] computes the bitslice data and skips the zero input bit-slices. Therefore, it can skip more zero computations by skipping the additional redundant computation caused by zero bit-slices.…”

Section: B Sparsity Exploiting Architecturementioning

confidence: 99%

“…Conventional bit-slice representation [6], [22] decomposes 2's complement fixed-point data to an MSB bit-slice which is a signed bit-slice and the lower slices which are unsigned bitslices. In this work, the SBR adds the sign bit to each unsigned bit-slice to produce the signed bit-slice, and it adds 1 value by borrowing from its lower bit slice if the data is negative value as shown in Fig.…”

Section: A Signed Bit-slice Representation and Its Encoding Unitmentioning

confidence: 99%

“…As a result, the signed MAC unit can achieve high efficiency by reducing the bit-width of a multiplier and output accumulation register compared to the conventional MAC unit in the bit-slice architecture. For example, the previous work [6] uses a 5b×5b MAC unit with sign extension to compute 4-bit, 8-bit, 12-bit, and 16bit precision data with its best MAC efficiency. On the other hand, the 5b×5b signed MAC unit can support 5-bit, 9-bit, 13-bit, and 17-bit precision data which shows a higher MAC efficiency than the previous work.…”

Section: B Signed Mac Unitmentioning

confidence: 99%

“…It lessens the memory bandwidth and on-chip memory footprint due to low bit-width of data, and it increases the computing throughput by integrating a large number of low bit multiplier-and-accumulate (MAC) units. Therefore, bit-slice accelerators [6], [12], [16], [21], [22] accelerate various bitprecision of DNNs with numerous number of low bit MAC units by dynamically matching the bit-width in a spatialand time-multiplexing method. However, accuracy-sensitive tasks such as image super-resolution [3] and monocular depth estimation [5] require higher precision than object classification [19], [25] and object detection [18], [20] tasks.…”

Section: Introductionmentioning

confidence: 99%

“…Furthermore, after decomposing the 2's complement of fixed-point data to bit-slices, additional sparsity is occurred in a small positive value of data whose high order of bit-slices are zeros. Therefore, zero bit-slice skipping architecture [6] takes the advantages of both low bit computing and zero bitslice skipping, which increases the energy-efficiency even in high bit-precision DNNs.…”

Section: Introductionmentioning

confidence: 99%

See 4 more Smart Citations

Energy-efficient Dense DNN Acceleration with Signed Bit-slice Architecture

Im¹,

Park²,

Li³

et al. 2022

Preprint

Self Cite

View full text Add to dashboard Cite

As the number of deep neural networks (DNNs) to be executed on a mobile system-on-chip (SoC) increases, the mobile SoC suffers from the real-time DNN acceleration within its limited hardware resources and power budget. Although the previous mobile neural processing units (NPUs) take advantages of low-bit computing and exploitation of the sparsity, it is incapable of accelerating high-precision and dense DNNs. This paper proposes energy-efficient signed bit-slice architecture which accelerates both high-precision and dense DNNs by exploiting a large number of zero values of signed bit-slices. Proposed signed bit-slice representation (SBR) changes signed 11112 bitslice to 00002 by borrowing a 1 value from its lower order of bit-slice. As a result, it generates a large number of zero bitslices even in dense DNNs. Moreover, it balances the positive and negative values of 2's complement data, allowing bit-slice based output speculation which pre-computes high order of bit-slices and skips the remaining dense low order of bit-slices. The signed bit-slice architecture compresses and skips the zero input signed bit-slices, and its zero skipping unit also supports the output skipping by masking the speculated inputs as zero. Additionally, the heterogeneous network-on-chip (NoC) benefits exploitation of data reusability and reduction of transmission bandwidth. The paper introduces a specialized instruction set architecture (ISA) and a hierarchical instruction decoder for the control of the signed bit-slice architecture. Finally, the signed bit-slice architecture outperforms the previous bit-slice accelerator, Bitfusion, over ×3.65 higher area-efficiency, ×3.88 higher energyefficiency, and ×5.35 higher throughput.

show abstract

Section: B Sparsity Exploiting Architecturementioning

confidence: 99%

Section: A Signed Bit-slice Representation and Its Encoding Unitmentioning

confidence: 99%

Section: B Signed Mac Unitmentioning

confidence: 99%