Compute-Efficient Neural-Network Acceleration

Wu, Ephrem; Zhang, Xiaoqian; Berman, David B.; Cho, Inkeun; Thendean, John

doi:10.1145/3289602.3293925

Cited by 18 publications

(11 citation statements)

References 10 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…The CORDIC module is used in hyperbolic rotation mode for realizing sinh and cosh functions using Eq. (10). We take an example calculation for sinh (30) and cosh (30) as shown in Table 3.…”

Section: B Activation Function Computation Using Reconmentioning

confidence: 99%

See 1 more Smart Citation

RECON: Resource-Efficient CORDIC-Based Neuron Architecture

Raut

Rai

Vishvakarma

et al. 2021

IEEE Open J. Circuits Syst.

View full text Add to dashboard Cite

Contemporary hardware implementations of artificial neural networks face the burden of excess area requirement due to resource-intensive elements such as multiplier and non-linear activation functions. The present work addresses this challenge by proposing a resource-efficient Coordinate Rotation Digital Computer (CORDIC)-based neuron architecture (RECON) which can be configured to compute both multiply-accumulate (MAC) and non-linear activation function (AF) operations. The CORDIC-based architecture uses linear and trigonometric relationships to realize MAC and AF operations respectively. The proposed design is synthesized and verified at 45nm technology using Cadence Virtuoso for all physical parameters. Implementation of the signed fixed-point 8-bit MAC using our design, shows 60% less area, latency, and power product (ALP) and shows improvement by 38% in area, 27% in power dissipation, and 15% in latency with respect to the state-of-the-art MAC design. Further, Monte-Carlo simulations for process-variations and device-mismatch are performed for both the proposed model and the state-of-the-art to evaluate expectations of functions of randomness in dynamic power variation. The dynamic power variation for our design shows that worst-case mean is 189.73μW which is 63% of the state-of-the-art. INDEX TERMS AF, CORDIC, configurable architecture, MAC, neural network.

show abstract

“…The CORDIC module is used in hyperbolic rotation mode for realizing sinh and cosh functions using Eq. (10). We take an example calculation for sinh (30) and cosh (30) as shown in Table 3.…”

Section: B Activation Function Computation Using Reconmentioning

confidence: 99%

“…This trade-off gets even more complicated when configurable architectures are required. The FPGAs offer configurable hardware designs but need more chip areas with higher power consumption as compared to ASICs [5], [9], [10].…”

Section: Introductionmentioning

confidence: 99%

RECON: Resource-Efficient CORDIC-Based Neuron Architecture

Raut

Rai

Vishvakarma

et al. 2021

IEEE Open J. Circuits Syst.

View full text Add to dashboard Cite

show abstract

“…This micro-architecture is optimized for both computation speed and resource utilization. In [24], a pipeline and DRAM-free architecture is proposed to simplify data movement between memory and processing elements. This design achieves very high working frequency as well as DSP utilization.…”

Section: Cnn Acceleratorsmentioning

confidence: 99%

SPEC2: SPECtral SParsE CNN Accelerator on FPGAs

Niu

Zeng

Srivastava

et al. 2019

2019 IEEE 26th International Conference on High Performance Computing, Data, and Analytics (HiPC)

View full text Add to dashboard Cite

To accelerate inference of Convolutional Neural Networks (CNNs), various techniques have been proposed to reduce computation redundancy. Converting convolutional layers into frequency domain significantly reduces the computation complexity of the sliding window operations in space domain. On the other hand, weight pruning techniques address the redundancy in model parameters by converting dense convolutional kernels into sparse ones. To obtain high-throughput FPGA implementation, we propose SPEC 2 -the first work to prune and accelerate spectral CNNs. First, we propose a systematic pruning algorithm based on Alternative Direction Method of Multipliers (ADMM). The offline pruning iteratively sets the majority of spectral weights to zero, without using any handcrafted heuristics. Then, we design an optimized pipeline architecture on FPGA that has efficient random access into the sparse kernels and exploits various dimensions of parallelism in convolutional layers. Overall, SPEC 2 achieves high inference throughput with extremely low computation complexity and negligible accuracy degradation. We demonstrate SPEC 2 by pruning and implementing LeNet and VGG16 on the Xilinx Virtex platform. After pruning 75% of the spectral weights, SPEC 2 achieves 0% accuracy loss for LeNet, and < 1% accuracy loss for VGG16. The resulting accelerators achieve up to 24× higher throughput, compared with the stateof-the-art FPGA implementations for VGG16.

show abstract

“…Furthermore, these hard blocks provide specialized nearest-neighbor interconnect for high-bandwidth, low-latency cascade data movement. These features make it particularly attractive for building systolic neural network accelerators such as CLP [42,43], Cascades [40], and Xilinx SuperTile [49,50].…”

Section: Introductionmentioning

confidence: 99%

“…The CLP designs presented in [42,43] only operate at 100ś170 MHz on Virtex-7 FPGAs but leave DSPs unused. The Xilinx SuperTile [49,50] designs run at 720 MHz, but leave half of the DSPs unused, and also waste URAM bandwidth by limiting access. The chip-spanning 650 MHz 1920×9 systolic array design for the VU11P FPGA [40] requires 95% or more of the hard block resources but fails to route in commercial-grade Xilinx Vivado run with high efort due to congestion.…”

Section: Introductionmentioning

confidence: 99%

RapidLayout: Fast Hard Block Placement of FPGA-optimized Systolic Arrays Using Evolutionary Algorithm

Zhang

Chen

Kapre

2022

ACM Trans. Reconfigurable Technol. Syst.

View full text Add to dashboard Cite

Evolutionary algorithms can outperform conventional placement algorithms such as simulated annealing, analytical placement, and manual placement on runtime, wirelength, pipelining cost, and clock frequency when mapping hard block intensive designs such as systolic arrays on Xilinx UltraScale+ FPGAs. For certain hard-block intensive designs, the commercial-grade Xilinx Vivado CAD tool cannot provide legal routing solutions without tedious manual placement constraints. Instead, we formulate hard block placement as a multi-objective optimization problem that targets wirelength squared and bounding box size. We build an end-to-end placement-and-routing flow called RapidLayout using the Xilinx RapidWright framework. RapidLayout runs 5–6 × faster than Vivado with manual constraints and eliminates the weeks-long effort to manually generate placement constraints. RapidLayout outperforms (1) a tuned simulated annealer by 2.7–30.8 × in runtime while achieving similar quality of results, (2) VPR by 1.5 × in runtime, 1.9–2.4 × in wirelength, and 3–4 × in bounding box size, while also (3) beating the analytical placer UTPlaceF by 9.3 × in runtime, 1.8–2.2 × in wirelength, and 2–2.7 × in bounding box size.

show abstract

Compute-Efficient Neural-Network Acceleration

Cited by 18 publications

References 10 publications

RECON: Resource-Efficient CORDIC-Based Neuron Architecture

RECON: Resource-Efficient CORDIC-Based Neuron Architecture

SPEC2: SPECtral SParsE CNN Accelerator on FPGAs

RapidLayout: Fast Hard Block Placement of FPGA-optimized Systolic Arrays Using Evolutionary Algorithm

Contact Info

Product

Resources

About