FPGA-Based Training Accelerator Utilizing Sparseness of Convolutional Neural Network

Nakahara, Hiroki; Sada, Youki; Shimoda, Masayuki; Sayama, Kouki; Jinguji, Akira; Sato, Shusei

doi:10.1109/fpl.2019.00036

Cited by 18 publications

(11 citation statements)

References 9 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…A sparse CNN training accelerator was designed on VCU1525. The accelerator was implemented on a pre-trained CNN model with 85% parameters pruned [20]. However, these existing works mainly focused on cloud-level devices with abundant computation and memory resources.…”

Section: Related Workmentioning

confidence: 99%

“…), the on-chip memory of an edge FPGA is not big enough to hold weights or features in every Conv layer. Therefore, several works [4,18,20] applied quantization or pruning to reduce of-chip memory access. However, unlike inference where compressed networks cause little accuracy decrease [7], these training works have not proved that their compression techniques can remain high accuracy on large datasets with dense networks.…”

Section: Andmentioning

confidence: 99%

See 1 more Smart Citation

EF-Train: Enable Efficient On-device CNN Training on FPGA through Data Reshaping for Online Adaptation or Personalization

Tang

Zhang

Zhou

et al. 2022

ACM Trans. Des. Autom. Electron. Syst.

View full text Add to dashboard Cite

Conventionally, DNN models are trained once in the cloud and deployed in edge devices such as cars, robots, or unmanned aerial vehicles (UAVs) for real-time inference. However, there are many cases that require the models to adapt to new environments, domains, or new users. In order to realize such domain adaption or personalization, the models on devices need to be continuously trained on the device. In this work, we design EF-Train, an efficient DNN training accelerator with a unified channel-level parallelism-based convolution kernel that can achieve end-to-end training on resource-limited low-power edge-level FPGAs. It is challenging to implement on-device training on resource-limited FPGAs due to the low efficiency caused by different memory access patterns among forward, backward propagation, and weight update. Therefore, we developed a data reshaping approach with intra-tile continuous memory allocation and weight reuse. An analytical model is established to automatically schedule computation and memory resources to achieve high energy efficiency on edge FPGAs. The experimental results show that our design achieves 46.99 GFLOPS and 6.09 GFLOPS/W in terms of throughput and energy efficiency, respectively.

show abstract

Section: Related Workmentioning

confidence: 99%

Section: Andmentioning

confidence: 99%

EF-Train: Enable Efficient On-device CNN Training on FPGA through Data Reshaping for Online Adaptation or Personalization

Tang

Zhang

Zhou

et al. 2022

ACM Trans. Des. Autom. Electron. Syst.

View full text Add to dashboard Cite

show abstract

“…Sparse Accelerators. Sparse accelerators [84,129,[168][169][170][171][172][173][174][175][176] address the inefficiencies caused by zeros contained in sparse matrices, which is a fundamentally different problem than padding introduced by transpose and dilated convolutions. EcoFlow can be incorporated to these accelerators to obtain aggregated benefits.…”

Section: Related Workmentioning

confidence: 99%

EcoFlow: Efficient Convolutional Dataflows for Low-Power Neural Network Accelerators

Orosa¹,

Koppula²,

Umuroglu³

et al. 2022

Preprint

View full text Add to dashboard Cite

Dilated and transposed convolutions are widely used in modern convolutional neural networks (CNNs). These kernels are used extensively during CNN training and inference of applications such as image segmentation and high-resolution image generation. Although these kernels have grown in popularity, they stress current compute systems due to their high memory intensity, exascale compute demands, and large energy consumption. We find that commonly-used low-power CNN inference accelerators based on spatial architectures are not optimized for both of these convolutional kernels. Dilated and transposed convolutions introduce significant zero padding when mapped to the underlying spatial architecture, significantly degrading performance and energy efficiency. Existing approaches that address this issue require significant design changes to the otherwise simple, efficient, and well-adopted architectures used to compute direct convolutions. To address this challenge, we propose EcoFlow, a new set of dataflows and mapping algorithms for dilated and transposed convolutions. These algorithms are tailored to execute efficiently on existing low-cost, small-scale spatial architectures and requires minimal changes to the network-on-chip of existing accelerators. At its core, EcoFlow eliminates zero padding through careful dataflow orchestration and data mapping tailored to the spatial architecture. EcoFlow enables flexible and high-performance transpose and dilated convolutions on architectures that are otherwise optimized for CNN inference. We evaluate the efficiency of our dataflows on CNN training workloads and Generative Adversarial Network (GAN) training workloads. Experiments in our new cycle-accurate spatial architecture simulator show that EcoFlow 1) reduces end-to-end CNN training time between 7-85%, and 2) improves end-to-end GAN training performance between 29-42%, compared to state-of-the-art CNN inference accelerators. [Open-Source Artifact]We open-source both our Spatial Architecture Simulator for Machine Learning (SASiML) and the SASiML compiler to help enable the development of new dataflows and high-accuracy simulation environments for new spatial architectures and dataflows. This can be freely found at https://github.com/CMU-SAFARI/sasiml.

show abstract

“…To aid the development of deep learning models on FPGAs, (Venieris and Bouganis, 2016) propose a framework for mapping CNNs on FPGAs. Furthermore, the authors (Ma et al, 2019;Nakahara et al, 2019) propose an FPGA base accelerator to leverage the sources of parallelism in order to achieve an efficient implementation of a deep convolutional neural network. Finally, (Zhu et al, 2020) presents a reconfigurable framework for training CNNs.…”

Section: Related Workmentioning

confidence: 99%

Efficient Implementation of Stochastic Computing Based Deep Neural Network on Low Cost Hardware with Saturation Arithmetic

Bodiwala

Nanavati²

2020

Journal of Computer Science

View full text Add to dashboard Cite

This study presents an efficient and rapid implementation of Stochastic Computing (SC) based Deep Neural Network (DNN) on a lowcost hardware platform. The proposed technique uses bipolar signal encoding in stochastic computing which relatively gives low hardware footprint compared to binary computing. Thereinafter, stochastic max function is presented and subsequently used to approximate the hyperbolic tangent activation function in SC. In addition, saturation arithmetic is proposed to reduce down scaling parameters that can further affect precision in computation. In this study, we demonstrate our SC-based DNN feasibility through a hardware accelerator prototype with the AXI Stream interface on a PYNQ Z2 board which is equipped with a XILINX ZYNQ XC7Z020-1CLG400C. The validity of this study is demonstrated through a MNIST handwritten digit recognition task. The experimental result shows our SCbased DNN model can be easily deployed on the embedded devices. The SC-based accelerator with AXI Stream interface performs at 1.877 GOP/s processing throughput, achieves higher accuracy with minimum area and energy consumption, consuming only 0.61 mm 2 area and 1.89W power.

show abstract

FPGA-Based Training Accelerator Utilizing Sparseness of Convolutional Neural Network

Cited by 18 publications

References 9 publications

EF-Train: Enable Efficient On-device CNN Training on FPGA through Data Reshaping for Online Adaptation or Personalization

EF-Train: Enable Efficient On-device CNN Training on FPGA through Data Reshaping for Online Adaptation or Personalization

EcoFlow: Efficient Convolutional Dataflows for Low-Power Neural Network Accelerators

Efficient Implementation of Stochastic Computing Based Deep Neural Network on Low Cost Hardware with Saturation Arithmetic

Contact Info

Product

Resources

About