2018
DOI: 10.1145/3243904
|View full text |Cite
|
Sign up to set email alerts
|

Layer-Centric Memory Reuse and Data Migration for Extreme-Scale Deep Learning on Many-Core Architectures

Abstract: Due to the popularity of Deep Neural Network (DNN) models, we have witnessed extreme-scale DNN models with the continued increase of the scale in terms of depth and width. However, the extremely high memory requirements for them make it difficult to run the training processes on single many-core architectures such as a Graphic Processing Unit (GPU), which compels researchers to use model parallelism over multiple GPUs to make it work. However, model parallelism always brings very heavy additional overhead. The… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

0
34
0

Year Published

2019
2019
2024
2024

Publication Types

Select...
5
1

Relationship

2
4

Authors

Journals

citations
Cited by 30 publications
(34 citation statements)
references
References 26 publications
0
34
0
Order By: Relevance
“…We do the evaluation in the following three aspects: First, the execution time potential of Layup is evaluated, comparing with some existing state-of-the-art works, including vDNN [32], Layrub [25], Caffe [24], and SuperNeurons [42]. Since SuperNeurons is the most effective approach among them and was put forward recently, we carry out a separate comparison with this technique in more detail.…”
Section: Discussionmentioning
confidence: 99%
See 4 more Smart Citations
“…We do the evaluation in the following three aspects: First, the execution time potential of Layup is evaluated, comparing with some existing state-of-the-art works, including vDNN [32], Layrub [25], Caffe [24], and SuperNeurons [42]. Since SuperNeurons is the most effective approach among them and was put forward recently, we carry out a separate comparison with this technique in more detail.…”
Section: Discussionmentioning
confidence: 99%
“…AlexNet is used for this comparison. Both of these memory optimizations are evaluated based on their best performance implementations [4,25] on Caffe. As shown in Figures 2(a) and 2(b), we make the following findings: the CPU-GPU transfer outperforms the extra forward computation in the CONV1-CONV5 and FC6-FC8 layers (by an average speedup of 7.4×), but underperforms for the rest of the layers.…”
Section: Issue 1: Performance Costs Of Memory-optimized Methodsmentioning
confidence: 99%
See 3 more Smart Citations