2019 IEEE International Symposium on Workload Characterization (IISWC) 2019
DOI: 10.1109/iiswc47752.2019.9042047
|View full text |Cite
|
Sign up to set email alerts
|

Characterizing Deep Learning Training Workloads on Alibaba-PAI

Abstract: Modern deep learning models have been exploited in various domains, including computer vision (CV), natural language processing (NLP), search and recommendation. In practical AI clusters, workloads training these models are run using software frameworks such as TensorFlow, Caffe, PyTorch and CNTK. One critical issue for efficiently operating practical AI clouds, is to characterize the computing and data transfer demands of these workloads, and more importantly, the training performance given the underlying sof… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
1
1

Citation Types

0
21
0

Year Published

2020
2020
2024
2024

Publication Types

Select...
4
4
1

Relationship

0
9

Authors

Journals

citations
Cited by 34 publications
(21 citation statements)
references
References 37 publications
0
21
0
Order By: Relevance
“…9 Appendix 9.1 PipeDream-2BW Convergence PipeDream-2BW boosts pipeline training throughput by sacrificing sync SGD semantics, which can lead to sub-optimal accuracy results. Although the authors show convergence results for Bert, asynchronous training is not a common practice today, primarily since it is unclear how final accuracy will be affected due to stale backward updates [39].…”
Section: Discussionmentioning
confidence: 98%
“…9 Appendix 9.1 PipeDream-2BW Convergence PipeDream-2BW boosts pipeline training throughput by sacrificing sync SGD semantics, which can lead to sub-optimal accuracy results. Although the authors show convergence results for Bert, asynchronous training is not a common practice today, primarily since it is unclear how final accuracy will be affected due to stale backward updates [39].…”
Section: Discussionmentioning
confidence: 98%
“…Xu et al [15] leverage virtualized GPU metrics and vCPU in isolation to propose an approach to predict slowdown from co-located DL workloads. Wang et al [19] obtain DL workload and infrastructure features to determine suitable training regime. Antman [10] also leverages GPU Utilization to first identify jobs that maybe suitable for co-location.…”
Section: Related Workmentioning
confidence: 99%
“…For consumers, this allows greater insight into potential GPU costs 1 . Understanding and exploiting DL workload utilization to improve co-location is critical for designing resourceefficient DL systems [19], [20], [10]. However, established approaches for characterizing GPU utilization from DL workloads leverage online profiling during execution.…”
Section: Introductionmentioning
confidence: 99%
“…On the one hand, due to cost, power, and space constraints, edge boxes typically possess weaker GPUs than their cloud counterparts [4, 27,83]. On the other hand, analytics deployments face rapidly increasing workloads due to the following trends: (1) more camera feeds to analyze [27,47,49], (2) more models to run due to increased popularity and shifts to bring-your-own-model platforms [16,23,38,48], and (3) increased model complexity, primarily through growing numbers of layers and parameters (Figure 1) [15,50,51,92]. Taken together, the result is an ever-worsening resource picture for edge video analytics.…”
Section: Introductionmentioning
confidence: 99%