2020 IEEE Symposium on High-Performance Interconnects (HOTI) 2020
DOI: 10.1109/hoti51249.2020.00021
|View full text |Cite
|
Sign up to set email alerts
|

Communication-Efficient Distributed Deep Learning with GPU-FPGA Heterogeneous Computing

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
2
1

Citation Types

0
5
0

Year Published

2021
2021
2024
2024

Publication Types

Select...
4
3

Relationship

0
7

Authors

Journals

citations
Cited by 7 publications
(5 citation statements)
references
References 7 publications
0
5
0
Order By: Relevance
“…Deep learning has been proved to be incredibly effective in a number of practical applications of artificial intelligence [24]. This is despite the fact that it demands a significant amount of computer resources in order to train a model such that it is of a high quality.…”
Section: Gpu Clustermentioning
confidence: 99%
See 2 more Smart Citations
“…Deep learning has been proved to be incredibly effective in a number of practical applications of artificial intelligence [24]. This is despite the fact that it demands a significant amount of computer resources in order to train a model such that it is of a high quality.…”
Section: Gpu Clustermentioning
confidence: 99%
“…The researchers that carried out the study [24]came to the conclusion that coupled communication calculations, such as All reduction, which are used to split processing outputs among graphics processing units (GPUs), create an unavoidable bottleneck. This is in conformity with the conclusions of the researchers.…”
Section: Literature Reviewmentioning
confidence: 99%
See 1 more Smart Citation
“…Latency results from the time required to fetch model parameters from off-chip DRAM or external SDCARDs before appropriate computation can be performed on these parameters [ 150 ]. Thus, storing the parameters as close as possible to the computation unit using Tiling and data reuse, hardware-oriented direct memory access (DMA) optimization techniques would reduce the latency and thus, inform high computation speed [ 152 ]. In addition, because ML models require a high level of parallelism for efficient performance, throughput is a major issue.…”
Section: Challenges and Optimization Opportunities In Embedded Machine Learningmentioning
confidence: 99%
“…This architectural organization if not properly utilized, can result in latency concerns. DMAs are units that transfer data between the external memory and the on-chip buffers in the processing logic of the FPGA [ 152 ]. Thus, optimizing this process would lead to an efficient performance in execution speed.…”
Section: Challenges and Optimization Opportunities In Embedded Machine Learningmentioning
confidence: 99%