Parallel I/O Optimizations for Scalable Deep Learning

Pumma, Sarunya; Si, Min; Feng, Wu-chun; Balaji, Pavan

doi:10.1109/icpads.2017.00097

Cited by 22 publications

(8 citation statements)

References 7 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…For example, recent studies [27,28,30,31,35] have documented the results on the evaluation of different DL applications on different HPC systems, but all of these mainly deal with the computation characterization. There have been some mentionable efforts [23,24,41,42,53,54] on I/O profiling and optimization for DL training workloads. Among them, Zhu et al [53] have used BeeGFS as one of the baseline PFS for the comparison with the DeepIO implementation.…”

Section: Related Workmentioning

confidence: 99%

I/O Characterization and Performance Evaluation of BeeGFS for Deep Learning

Chowdhury

Zhu

Heer

et al. 2019

Proceedings of the 48th International Conference on Parallel Processing

View full text Add to dashboard Cite

Parallel File Systems (PFSs) are frequently deployed on leadership High Performance Computing (HPC) systems to ensure efficient I/O, persistent storage and scalable performance. Emerging Deep Learning (DL) applications incur new I/O and storage requirements to HPC systems with batched input of small random files. This mandates PFSs to have commensurate features that can meet the needs of DL applications. BeeGFS is a recently emerging PFS that has grabbed the attention of the research and industry world because of its performance, scalability and ease of use. While emphasizing a systematic performance analysis of BeeGFS, in this paper, we present the architectural and system features of BeeGFS, and perform an experimental evaluation using cutting-edge I/O, Metadata and DL application benchmarks. Particularly, we have utilized AlexNet and ResNet-50 models for the classification of ImageNet dataset using the Livermore Big Artificial Neural Network Toolkit (LBANN), and ImageNet data reader pipeline atop TensorFlow and Horovod. Through extensive performance characterization of BeeGFS, our study provides a useful documentation on how to leverage BeeGFS for the emerging DL applications.

show abstract

Section: Related Workmentioning

confidence: 99%

I/O Characterization and Performance Evaluation of BeeGFS for Deep Learning

Chowdhury

Zhu

Heer

et al. 2019

Proceedings of the 48th International Conference on Parallel Processing

View full text Add to dashboard Cite

show abstract

“…Regarding the issue of I/O and storage for deep learning, both the HPC and deep learning communities have, so far, dedicated most efforts to access large training datasets efficiently [10], [11], [12], [13], while leaving the problem of optimized checkpointing of learning models largely ignored. TensorFlow checkpoints model to files in its SavedModel format, 1 or in HDF5 files through Keras.…”

Section: Related Workmentioning

confidence: 99%

DeepFreeze: Towards Scalable Asynchronous Checkpointing of Deep Learning Models

Nicolae

Wozniak

et al. 2020

2020 20th IEEE/ACM International Symposium on Cluster, Cloud and Internet Computing (CCGRID)

View full text Add to dashboard Cite

In the age of big data, deep learning has emerged as a powerful tool to extract insight and exploit its value, both in industry and scientific applications. One common pattern emerging in such applications is frequent checkpointing of the state of the learning model during training, needed in a variety of scenarios: analysis of intermediate states to explain features and correlations with training data, exploration strategies involving alternative models that share a common ancestor, knowledge transfer, resilience, etc. However, with increasing size of the learning models and popularity of distributed data-parallel training approaches, simple checkpointing techniques used so far face several limitations: low serialization performance, blocking I/O, stragglers due to the fact that only a single process is involved in checkpointing. This paper proposes a checkpointing technique specifically designed to address the aforementioned limitations, introducing efficient asynchronous techniques to hide the overhead of serialization and I/O, and distribute the load over all participating processes. Experiments with two deep learning applications (CANDLE and ResNet) on a pre-Exascale HPC platform (Theta) shows significant improvement over state-of-art, both in terms of checkpointing duration and runtime overhead. Index Terms-checkpointing; deep learning; fine-grain asynchronous I/O; multi-level data persistence

show abstract

“…DNN model checkpointing: The problem of checkpointing DNN models efficiently is beginning to emerge in deep learning, where most efforts so far focus on efficient access of training batches [28], [29], [30], [31]. TensorFlow checkpoints model to files in its SavedModel format, 1 or in HDF5 files through Keras.…”

Section: Background and Problem Formulationmentioning

confidence: 99%

DeepClone: Lightweight State Replication of Deep Learning Models for Data Parallel Training

Nicolae

Wozniak

Dorier

et al. 2020

2020 IEEE International Conference on Cluster Computing (CLUSTER)

View full text Add to dashboard Cite

Training modern deep neural network (DNN) models involves complex workflows triggered by model exploration, sensitivity analysis, explainability, etc. A key primitive in this context is the ability to clone a model training instance, i.e. "fork" the training process in a potentially different direction, which enables comparisons of different evolution paths using variations of training data and model parameters. However, in a quest improve the training throughput, a mix of data parallel, model parallel, pipeline parallel and layer-wise parallel approaches are making the problem of cloning highly complex. In this paper, we explore the problem of efficient cloning under such circumstances. To this end, we leverage several properties of dataparallel training and layer-wise parallelism to design DeepClone, a cloning approach based on augmenting the execution graph to gain direct access to tensors, which are then sharded and reconstructed asynchronously in order to minimize runtime overhead, standby duration, readiness duration. Compared with state-of-art approaches, DeepClone shows orders of magnitude improvement for several classes of DNN models.

show abstract

Parallel I/O Optimizations for Scalable Deep Learning

Cited by 22 publications

References 7 publications

I/O Characterization and Performance Evaluation of BeeGFS for Deep Learning

I/O Characterization and Performance Evaluation of BeeGFS for Deep Learning

DeepFreeze: Towards Scalable Asynchronous Checkpointing of Deep Learning Models

DeepClone: Lightweight State Replication of Deep Learning Models for Data Parallel Training

Contact Info

Product

Resources

About