2018 28th International Conference on Field Programmable Logic and Applications (FPL) 2018
DOI: 10.1109/fpl.2018.00074
|View full text |Cite
|
Sign up to set email alerts
|

A Framework for Acceleration of CNN Training on Deeply-Pipelined FPGA Clusters with Work and Weight Load Balancing

Abstract: Deep Neural Networks (DNNs) have revolutionized numerous applications, but the demand for ever more performance remains unabated. Scaling DNN computations to larger clusters is generally done by distributing tasks in batch mode using methods such as distributed synchronous SGD. Among the issues with this approach is that to make the distributed cluster work with high utilization, the workload distributed to each node must be large, which implies nontrivial growth in the SGD mini-batch size. In this paper, we p… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

0
24
0

Year Published

2020
2020
2023
2023

Publication Types

Select...
6
2

Relationship

0
8

Authors

Journals

citations
Cited by 47 publications
(24 citation statements)
references
References 29 publications
0
24
0
Order By: Relevance
“…The power dissipation is also reduced with less use of DSP resources. The power efficiency is 3.27× more than the most advanced multi-FPGA structure-based design [8]. We have also compared our result with MobileNet V2 on Intel Arria 10 SoC FPGA [9], which used more BRAM and DSP than our design.…”
Section: Results For Imagenet Classificationmentioning
confidence: 95%
See 1 more Smart Citation
“…The power dissipation is also reduced with less use of DSP resources. The power efficiency is 3.27× more than the most advanced multi-FPGA structure-based design [8]. We have also compared our result with MobileNet V2 on Intel Arria 10 SoC FPGA [9], which used more BRAM and DSP than our design.…”
Section: Results For Imagenet Classificationmentioning
confidence: 95%
“…Guo et al [29] proposed a CNN design with a data quantization strategy and compilation tool which could get 137 GOPS throughput on Zynq XC7Z045 FPGA. Geng et al [8] proposed a quantitative model for mapping CNNs on multi-FPGAs to improve the throughput. However, the power consumption will increase greatly by using an FPGA cluster.…”
Section: Background a Related Workmentioning
confidence: 99%
“…To fully utilize the computation power provided by multiple FPGAs, a typical technique is to implement the neural network on multiple FPGAs in a pipelined fashion [15], [20], [22], [23]. Figure 2 demonstrates one such example, in which a 5-layer network is partitioned into 3 pipeline stages, and each pipeline stage is mapped to a certain FPGA in an available pool.…”
Section: B Implementing Dnns On Fpgasmentioning
confidence: 99%
“…In the early stage, research efforts are mainly focusing on designing accelerators on a single FPGA [28]- [31]. Most recently, implementations on multiple FPGAs has become the mainstream [15], [18]- [20], [22], [23], since limited resource on a single FPGA becomes the performance bottleneck.…”
Section: Partition (P)mentioning
confidence: 99%
“…Note that we do not claim that this is the optimal convolution accelerator implementation. Finding the best design instance often require extensive exploration of a large design space involving loop transformations and data layout optimization [15], and this problem is completely orthogonal to our approach. One important notion in their work that we also use is the tiles.…”
Section: A Baseline Accelerator Designmentioning
confidence: 99%