Benchmarking and Analyzing Deep Neural Network Training

Zhu, Hongyu; Akrout, Mohamed; Zheng, Bojian; Pelegris, Andrew; Jayarajan, Anand; Phanishayee, Amar; Schroeder, Bianca; Pekhimenko, Gennady

doi:10.1109/iiswc.2018.8573476

Cited by 128 publications

(88 citation statements)

References 71 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…DNN performance analysis and optimization Current publicly available benchmarks [3,8,19,55] for DNNs fo-cus on neural networks with FC, CNN, and RNN layers only.…”

Section: Related Workmentioning

confidence: 99%

The Architectural Implications of Facebook's DNN-Based Personalized Recommendation

Gupta

Wang

et al. 2020

2020 IEEE International Symposium on High Performance Computer Architecture (HPCA)

210

182

View full text Add to dashboard Cite

The widespread application of deep learning has changed the landscape of computation in the data center. In particular, personalized recommendation for content ranking is now largely accomplished leveraging deep neural networks. However, despite the importance of these models and the amount of compute cycles they consume, relatively little research attention has been devoted to systems for recommendation. To facilitate research and to advance the understanding of these workloads, this paper presents a set of real-world, productionscale DNNs for personalized recommendation coupled with relevant performance metrics for evaluation. In addition to releasing a set of open-source workloads, we conduct indepth analysis that underpins future system design and optimization for at-scale recommendation: Inference latency varies by 60% across three Intel server generations, batching and co-location of inferences can drastically improve latency-bounded throughput, and the diverse composition of recommendation models leads to different optimization strategies.Preprint. Under submission.

show abstract

“…DNN performance analysis and optimization Current publicly available benchmarks [3,8,19,55] for DNNs fo-cus on neural networks with FC, CNN, and RNN layers only.…”

Section: Related Workmentioning

confidence: 99%

The Architectural Implications of Facebook's DNN-Based Personalized Recommendation

Gupta

Wang

et al. 2020

2020 IEEE International Symposium on High Performance Computer Architecture (HPCA)

210

182

View full text Add to dashboard Cite

show abstract

“…Zhu et al [15] study the training performance and resource utilization of eight deep learning model models implemented on three machine learning frameworks running on servers (not mobile devices) across different hardware configurations. However, they do not consider power and energy efficiency.…”

Section: Related Workmentioning

confidence: 99%

Performance Analysis and Characterization of Training Deep Learning Models on Mobile Device

Liu

et al. 2019

2019 IEEE 25th International Conference on Parallel and Distributed Systems (ICPADS)

View full text Add to dashboard Cite

Training deep learning models on mobile devices recently becomes possible, because of increasing computation power on mobile hardware and the advantages of enabling high user experiences. Most of the existing work on machine learning at mobile devices is focused on the inference of deep learning models, but not training. The performance characterization of training deep learning models on mobile devices is largely unexplored, although understanding the performance characterization is critical for designing and implementing deep learning models on mobile devices.In this paper, we perform a variety of experiments on a representative mobile device (the NVIDIA TX2) to study the performance of training deep learning models. We introduce a benchmark suite and a tool to study performance of training deep learning models on mobile devices, from the perspectives of memory consumption, hardware utilization, and power consumption. The tool can correlate performance results with fine-grained operations in deep learning models, providing capabilities to capture performance variance and problems at a fine granularity. We reveal interesting performance problems and opportunities, including under-utilization of heterogeneous hardware, large energy consumption of the memory, and high predictability of workload characterization. Based on the performance analysis, we suggest interesting research directions.

show abstract

“…However, there is no discussion in the text about potential scalability effects beyond 8 GPUs. An extensive performance analysis and profiling of DNN training is performed in [31], where eight state-of-the-art DNN models are implemented on three major deep learning frameworks (TensorFlow, MXNet, and CNTK). The objective is to evaluate the efficiency of training for different hardware configurations (single-/multi-GPU and multi-machine).…”

Section: Related Workmentioning

confidence: 99%

Performance Aware Convolutional Neural Network Channel Pruning for Embedded GPUs

Radu

Kaszyk

Wen

et al. 2019

2019 IEEE International Symposium on Workload Characterization (IISWC)

View full text Add to dashboard Cite

Convolutional Neural Networks (CNN) are becoming a common presence in many applications and services, due to their superior recognition accuracy. They are increasingly being used on mobile devices, many times just by porting large models designed for server space, although several model compression techniques have been considered. One model compression technique intended to reduce computations is channel pruning. Mobile and embedded systems now have GPUs which are ideal for the parallel computations of neural networks and for their lower energy cost per operation. Specialized libraries perform these neural network computations through highly optimized routines. As we find in our experiments, these libraries are optimized for the most common network shapes, making uninstructed channel pruning inefficient. We evaluate higher level libraries, which analyze the input characteristics of a convolutional layer, based on which they produce optimized OpenCL (Arm Compute Library and TVM) and CUDA (cuDNN) code. However, in reality, these characteristics and subsequent choices intended for optimization can have the opposite effect. We show that a reduction in the number of convolutional channels, pruning 12% of the initial size, is in some cases detrimental to performance, leading to 2× slowdown. On the other hand, we also find examples where performance-aware pruning achieves the intended results, with performance speedups of 3× with cuDNN and above 10× with Arm Compute Library and TVM. Our findings expose the need for hardware-instructed neural network pruning.

show abstract

Benchmarking and Analyzing Deep Neural Network Training

Cited by 128 publications

References 71 publications

The Architectural Implications of Facebook's DNN-Based Personalized Recommendation

The Architectural Implications of Facebook's DNN-Based Personalized Recommendation

Performance Analysis and Characterization of Training Deep Learning Models on Mobile Device

Performance Aware Convolutional Neural Network Channel Pruning for Embedded GPUs

Contact Info

Product

Resources

About