FPDeep: Acceleration and Load Balancing of CNN Training on FPGA Clusters

Geng, Tong; Wang, Tianqi; Sanaullah, Ahmed; Yang, Cheng‐Hong; Xu, Rui; Patel, Rushi; Herbordt, Martin C.

doi:10.1109/fccm.2018.00021

Cited by 72 publications

(30 citation statements)

References 2 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Although the variety of works of CNNs on FPGA is very higher, only a few papers exploit a system with multiple FPGAs. This is the case of [8], [9] where a deeply pipelined multi-FPGA architecture is used both for training and inference of CNNs. However, deeply pipelined multi-FPGA architecture fits only a specific class of algorithms within the distributed scenarios and authors described a custom communication infrastructure to deal with distributed nodes communication, instead of trying to generalize the technique.…”

Section: Related Workmentioning

confidence: 99%

Hardware resources analysis of BNNs splitting for FARD-based multi-FPGAs Distributed Systems

Fiscaletti

Speziali

Stornaiuolo

et al. 2020

2020 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)

View full text Add to dashboard Cite

FPGAs have proven to be valid architectures to accelerate the inference phase of Convolutional Neural Networks (CNNs). State-of-the-art works also demonstrated that it is possible to take advantage of a distributed FPGA-base system to improve performance, power consumption and scalability of such algorithms. However, the hardware resource usage, communication, and the nodes management become main aspects when dealing with an embedded distributed scenario. In this context, FINN optimizes the FPGA-based CNNs with binarization and FARD is a framework that allows the acceleration of fog computing-based application with FPGAs. In this work, we present how to extend FARD to deal with job-based applications rather than the event-based fog computing scenario. In particular, we analyzed two PYNQ-Z1 connected each other and we implemented a distributed BNN algorithm based on FINN's CnvW2A2. Results show how hardware resources vary according to the division of the network when splitting after each convolutional layer.

show abstract

Section: Related Workmentioning

confidence: 99%

Hardware resources analysis of BNNs splitting for FARD-based multi-FPGAs Distributed Systems

Fiscaletti

Speziali

Stornaiuolo

et al. 2020

2020 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)

View full text Add to dashboard Cite

show abstract

“…Most recently, with the growing demand in time performance, it is a trend to employ a cluster of FPGAs to execute DNNs [15,[26][27][28][29][30][31][32]. In [15,28], authors construct multiple FPGAs as a pipeline to execute a set of input images in a pipeline fashion.…”

Section: Related Workmentioning

confidence: 99%

“…In [26], authors split the CNN layers to balance pipeline stages for higher throughput and lower cost. Authors in [27] employ multiple FPGAs for the training phase. In [29,30], multi-FPGA platforms are utilized to accelerate the lung nodule segmentation.…”

Section: Related Workmentioning

confidence: 99%

Achieving Super-Linear Speedup across Multi-FPGA for Real-Time DNN Inference

Jiang

Sha

Zhang

et al. 2019

ACM Trans. Embed. Comput. Syst.

View full text Add to dashboard Cite

Real-time Deep Neural Network (DNN) inference with low-latency requirement has become increasingly important for numerous applications in both cloud computing (e.g., Apple's Siri) and edge computing (e.g., Google/Waymo's driverless car). FPGA-based DNN accelerators have demonstrated both superior flexibility and performance; in addition, for real-time inference with low batch size, FPGA is expected to achieve further performance improvement. However, the performance gain from the single-FPGA design is obstructed by the limited on-chip resource. In this paper, we employ multiple FPGAs to cooperatively run DNNs with the objective of achieving super-linear speed-up against single-FPGA design. In implementing such systems, we found two barriers that hinder us from achieving the design goal: (1) the lack of a clear partition scheme for each DNN layer to fully exploit parallelism, and (2) the insufficient bandwidth between the off-chip memory and the accelerator due to the growing size of DNNs. To tackle these issues, we propose a general framework, "Super-LIP", which can support different kinds of DNNs. In this paper, we take Convolutional Neural Network (CNN) as a vehicle to illustrate Super-LIP. We first formulate an accurate system-level model to support the exploration of best partition schemes. Then, we develop a novel design methodology to effectively alleviate the heavy loads on memory bandwidth by moving traffic from memory bus to inter-FPGA links. We implement Super-LIP based on ZCU102 FPGA boards. Results demonstrate that Super-LIP with 2 FPGAs can achieve 3.48× speedup, compared to the state-of-the-art single-FPGA design. What is more, as the number of FPGAs scales up, the system latency can be further reduced while maintaining high energy efficiency.

show abstract

“…Our co-exploration concept and the general framework, however, can also be easily extended to other hardware platforms such as ASICs. Since timing performance on a single FPGA is limited by its restricted resource, it is prevalent to organize multiple FPGAs in a pipelined fashion [20]- [23] to provide high throughput (frame per second, FPS). In such a system, the pipeline efficiency is one of the most important metrics needing to be maximized, since it determines the hardware utilization as well as energy efficiency.…”

Section: Introductionmentioning

confidence: 99%

Hardware/Software Co-Exploration of Neural Architectures

Jiang

Yang

Sha

et al. 2020

IEEE Trans. Comput.-Aided Des. Integr. Circuits Syst.

122

View full text Add to dashboard Cite

We propose a novel hardware and software coexploration framework for efficient neural architecture search (NAS). Different from existing hardware-aware NAS which assumes a fixed hardware design and explores the neural architecture search space only, our framework simultaneously explores both the architecture search space and the hardware design space to identify the best neural architecture and hardware pairs that maximize both test accuracy and hardware efficiency. Such a practice greatly opens up the design freedom and pushes forward the Pareto frontier between hardware efficiency and test accuracy for better design tradeoffs. The framework iteratively performs a two-level (fast and slow) exploration. Without lengthy training, the fast exploration can effectively fine-tune hyperparameters and prune inferior architectures in terms of hardware specifications, which significantly accelerates the NAS process. Then, the slow exploration trains candidates on a validation set and updates a controller using the reinforcement learning to maximize the expected accuracy together with the hardware efficiency. Experiments on ImageNet show that our co-exploration NAS can find the neural architectures and associated hardware design with the same accuracy, 35.24% higher throughput, 54.05% higher energy efficiency and 136× reduced search time, compared with the state-of-the-art hardware-aware NAS.

show abstract

FPDeep: Acceleration and Load Balancing of CNN Training on FPGA Clusters

Cited by 72 publications

References 2 publications

Hardware resources analysis of BNNs splitting for FARD-based multi-FPGAs Distributed Systems

Hardware resources analysis of BNNs splitting for FARD-based multi-FPGAs Distributed Systems

Achieving Super-Linear Speedup across Multi-FPGA for Real-Time DNN Inference

Hardware/Software Co-Exploration of Neural Architectures

Contact Info

Product

Resources

About