A Framework for Acceleration of CNN Training on Deeply-Pipelined FPGA Clusters with Work and Weight Load Balancing

Geng, Tong; Wang, Tianqi; Sanaullah, Ahmed; Yang, Cheng‐Hong; Patel, Rushi; Herbordt, Martin C.

doi:10.1109/fpl.2018.00074

Cited by 47 publications

(24 citation statements)

References 29 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…The power dissipation is also reduced with less use of DSP resources. The power efficiency is 3.27× more than the most advanced multi-FPGA structure-based design [8]. We have also compared our result with MobileNet V2 on Intel Arria 10 SoC FPGA [9], which used more BRAM and DSP than our design.…”

Section: Results For Imagenet Classificationmentioning

confidence: 95%

“…Guo et al [29] proposed a CNN design with a data quantization strategy and compilation tool which could get 137 GOPS throughput on Zynq XC7Z045 FPGA. Geng et al [8] proposed a quantitative model for mapping CNNs on multi-FPGAs to improve the throughput. However, the power consumption will increase greatly by using an FPGA cluster.…”

Section: Background a Related Workmentioning

confidence: 99%

See 1 more Smart Citation

A Novel FPGA Accelerator Design for Real-Time and Ultra-Low Power Deep Convolutional Neural Networks Compared With Titan X GPU

Luo

Sun

et al. 2020

IEEE Access

View full text Add to dashboard Cite

Convolutional neural networks (CNNs) based deep learning algorithms require high data flow and computational intensity. For real-time industrial applications, they need to overcome challenges such as high data bandwidth requirement and power consumption on hardware platforms. In this work, we have analyzed in detail the data dependency in the CNN accelerator and propose specific pipelined operations and data organized manner to design a high throughput CNN accelerator on FPGA. Besides, we have optimized the kernel operations to obtain a high power efficiency. The proposed CNN accelerator supports image classification and real-time object detection with high accuracy. The evaluation results show that our CNNbased FPGA accelerator can achieve 740 Giga operations per second (GOPS) at 200 MHz with kernel power of 12.2 watts on Intel Arria 10 FPGA. For object detection tasks, our system can achieve 105 fps with 56.5 mAP or 25 fps with 73.6 mAP on VOC dataset. Since we use the mixed fixed-point data representation, the detection accuracy is comparable with the GPU-based YOLO V2 framework. The power efficiency of our system is ∼ 3.3× better than Titan X GPU and ∼ 418× better than Intel E5-2620 V4 CPU.

show abstract

Section: Results For Imagenet Classificationmentioning

confidence: 95%

Section: Background a Related Workmentioning

confidence: 99%

A Novel FPGA Accelerator Design for Real-Time and Ultra-Low Power Deep Convolutional Neural Networks Compared With Titan X GPU

Luo

Sun

et al. 2020

IEEE Access

View full text Add to dashboard Cite

show abstract

“…To fully utilize the computation power provided by multiple FPGAs, a typical technique is to implement the neural network on multiple FPGAs in a pipelined fashion [15], [20], [22], [23]. Figure 2 demonstrates one such example, in which a 5-layer network is partitioned into 3 pipeline stages, and each pipeline stage is mapped to a certain FPGA in an available pool.…”

Section: B Implementing Dnns On Fpgasmentioning

confidence: 99%

“…In the early stage, research efforts are mainly focusing on designing accelerators on a single FPGA [28]- [31]. Most recently, implementations on multiple FPGAs has become the mainstream [15], [18]- [20], [22], [23], since limited resource on a single FPGA becomes the performance bottleneck.…”

Section: Partition (P)mentioning

confidence: 99%

Hardware/Software Co-Exploration of Neural Architectures

Jiang

Yang

Sha

et al. 2020

IEEE Trans. Comput.-Aided Des. Integr. Circuits Syst.

122

View full text Add to dashboard Cite

We propose a novel hardware and software coexploration framework for efficient neural architecture search (NAS). Different from existing hardware-aware NAS which assumes a fixed hardware design and explores the neural architecture search space only, our framework simultaneously explores both the architecture search space and the hardware design space to identify the best neural architecture and hardware pairs that maximize both test accuracy and hardware efficiency. Such a practice greatly opens up the design freedom and pushes forward the Pareto frontier between hardware efficiency and test accuracy for better design tradeoffs. The framework iteratively performs a two-level (fast and slow) exploration. Without lengthy training, the fast exploration can effectively fine-tune hyperparameters and prune inferior architectures in terms of hardware specifications, which significantly accelerates the NAS process. Then, the slow exploration trains candidates on a validation set and updates a controller using the reinforcement learning to maximize the expected accuracy together with the hardware efficiency. Experiments on ImageNet show that our co-exploration NAS can find the neural architectures and associated hardware design with the same accuracy, 35.24% higher throughput, 54.05% higher energy efficiency and 136× reduced search time, compared with the state-of-the-art hardware-aware NAS.

show abstract

“…Note that we do not claim that this is the optimal convolution accelerator implementation. Finding the best design instance often require extensive exploration of a large design space involving loop transformations and data layout optimization [15], and this problem is completely orthogonal to our approach. One important notion in their work that we also use is the tiles.…”

Section: A Baseline Accelerator Designmentioning

confidence: 99%

Safe Overclocking for CNN Accelerators Through Algorithm-Level Error Detection

Marty

Yuki

Derrien

2020

IEEE Trans. Comput.-Aided Des. Integr. Circuits Syst.

View full text Add to dashboard Cite

In this paper, we propose a technique for improving the efficiency of CNN hardware accelerators based on timing speculation (overclocking) and fault tolerance. We augment the accelerator with a lightweight error detection mechanism to protect against timing errors in convolution layers, enabling aggressive timing speculation. The error detection mechanism we have developed works at the algorithm-level, utilizing algebraic properties of the computation, allowing the full implementation to be realized using High-Level Synthesis tools. Our prototype on ZC706 demonstrated up to 60% higher throughput with negligible area overhead for various wordlength implementations.

show abstract

A Framework for Acceleration of CNN Training on Deeply-Pipelined FPGA Clusters with Work and Weight Load Balancing

Cited by 47 publications

References 29 publications

A Novel FPGA Accelerator Design for Real-Time and Ultra-Low Power Deep Convolutional Neural Networks Compared With Titan X GPU

A Novel FPGA Accelerator Design for Real-Time and Ultra-Low Power Deep Convolutional Neural Networks Compared With Titan X GPU

Hardware/Software Co-Exploration of Neural Architectures

Safe Overclocking for CNN Accelerators Through Algorithm-Level Error Detection

Contact Info

Product

Resources

About