Runtime Concurrency Control and Operation Scheduling for High Performance Neural Network Training

Liu, Jiawen; Li, Dong; Kestor, Gokcen; Vetter, Jeffrey S.

doi:10.1109/ipdps.2019.00029

Cited by 8 publications

(10 citation statements)

References 16 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…This indicates that the computation across time steps remains stable and hence is highly predictable. This observation is consistent with the existing work that leverages predictability of deep learning workloads for performance optimization [50], [51].…”

Section: B Performance Analysissupporting

confidence: 91%

“…Such predictability allows us to apply dynamic profiling on a few training steps to collect workload characterization, based on which we guide operation scheduling and power management in the future training steps. Predictability of execution time during the training has been leveraged in the existing work [50], [51]. We expect to leverage the predictability of other characterization in the future work.…”

Section: Discussion and Future Research Directionsmentioning

confidence: 99%

“…But the variance of GPU utilization on servers does not change as much as that on mobile devices, because GPU on servers have more cores and hence offers more thread-level parallelism to work on increased computation as we increase the batch size. [51] Observation 4: Different cores have different utilization during the training. TX2 has six heterogeneous cores: Two of them are Denver2 and four of them are A57.…”

Section: B Performance Analysismentioning

confidence: 99%

See 2 more Smart Citations

Performance Analysis and Characterization of Training Deep Learning Models on Mobile Device

Liu

et al. 2019

2019 IEEE 25th International Conference on Parallel and Distributed Systems (ICPADS)

Self Cite

View full text Add to dashboard Cite

Training deep learning models on mobile devices recently becomes possible, because of increasing computation power on mobile hardware and the advantages of enabling high user experiences. Most of the existing work on machine learning at mobile devices is focused on the inference of deep learning models, but not training. The performance characterization of training deep learning models on mobile devices is largely unexplored, although understanding the performance characterization is critical for designing and implementing deep learning models on mobile devices.In this paper, we perform a variety of experiments on a representative mobile device (the NVIDIA TX2) to study the performance of training deep learning models. We introduce a benchmark suite and a tool to study performance of training deep learning models on mobile devices, from the perspectives of memory consumption, hardware utilization, and power consumption. The tool can correlate performance results with fine-grained operations in deep learning models, providing capabilities to capture performance variance and problems at a fine granularity. We reveal interesting performance problems and opportunities, including under-utilization of heterogeneous hardware, large energy consumption of the memory, and high predictability of workload characterization. Based on the performance analysis, we suggest interesting research directions.

show abstract

Section: B Performance Analysissupporting

confidence: 91%

Section: Discussion and Future Research Directionsmentioning

confidence: 99%

Section: B Performance Analysismentioning

confidence: 99%

See 1 more Smart Citation

Performance Analysis and Characterization of Training Deep Learning Models on Mobile Device

Liu

et al. 2019

2019 IEEE 25th International Conference on Parallel and Distributed Systems (ICPADS)

Self Cite

View full text Add to dashboard Cite

show abstract

“…Despite concentrating on GPUs, Liu et al [19] proposed a lightweight machine learning based performance model to choose the number of threads to use for the parallelization of the training of a neural network (NN). They chose iii to use non-deterministic features collected by hardware counters, namely, the number of CPU cycles, the numbers of cache misses, the accesses for the last cache level, and the number of level 1 cache hits.…”

Section: Related Contributionsmentioning

confidence: 99%

“…In this paper, we propose using machine learning to directly predict the optimal chunk-size to achieve the best performance instead of predicting the execution time. Also, we do not attempt to find the optimal number of cores to run an application on like in [19]. In our research, it is assumed that the user is working on a given number of cores and simply want to find the optimal way to share the workload between these cores.…”

Section: Related Contributionsmentioning

confidence: 99%

Scheduling Optimization of Parallel Linear Algebra Algorithms Using Supervised Learning

Laberge

Shirzad

Diehl

et al. 2019

2019 IEEE/ACM Workshop on Machine Learning in High Performance Computing Environments (MLHPC)

View full text Add to dashboard Cite

Linear algebra algorithms are used widely in a variety of domains, e.g machine learning, numerical physics and video games graphics. For all these applications, loop-level parallelism is required to achieve high performance. However, finding the optimal way to schedule the workload between threads is a non-trivial problem because it depends on the structure of the algorithm being parallelized and the hardware the executable is run on. In the realm of Asynchronous Many Task runtime systems, a key aspect of the scheduling problem is predicting the proper chunk-size, where the chunk-size is defined as the number of iterations of a for-loop assigned to a thread as one task. In this paper, we study the applications of supervised learning models to predict the chunk-size which yields maximum performance on multiple parallel linear algebra operations using the HPX backend of Blaze's linear algebra library. More precisely, we generate our training and tests sets by measuring performance of the application with different chunk-sizes for multiple linear algebra operations; vector-addition, matrix-vector-multiplication, matrix-matrix addition and matrix-matrix-multiplication. We compare the use of logistic regression, neural networks and decision trees with a newly developed decision tree based model in order to predict the optimal value for chunk-size. Our results show that classical decision trees and our custom decision tree model are able to forecast a chunk-size which results in good performance for the linear algebra operations.

show abstract

Talos: A Weighted Speedup-Aware Device Placement of Deep Learning Models

Zhang

et al. 2021

2021 IEEE 32nd International Conference on Application-Specific Systems, Architectures and Processors (ASAP)

View full text Add to dashboard Cite

Runtime Concurrency Control and Operation Scheduling for High Performance Neural Network Training

Cited by 8 publications

References 16 publications

Performance Analysis and Characterization of Training Deep Learning Models on Mobile Device

Performance Analysis and Characterization of Training Deep Learning Models on Mobile Device

Scheduling Optimization of Parallel Linear Algebra Algorithms Using Supervised Learning

Talos: A Weighted Speedup-Aware Device Placement of Deep Learning Models

Contact Info

Product

Resources

About