Entropy-Aware I/O Pipelining for Large-Scale Deep Learning on HPC Systems

Zhu, Yuemin; Chowdhury, Fahim; Fu, Huansong; Moody, Adam; Mohror, Kathryn; Sato, Kento; Yu, Weikuan

doi:10.1109/mascots.2018.00023

Cited by 57 publications

(40 citation statements)

References 15 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…While reading a huge dataset, read requests to physical backend devices may frequently happen, since the dataset cannot fit entirely in the PFS's cache. These frequent I/O requests to read all the data from a large dataset at each epoch lead to relatively slower I/O performance than that for smaller datasets [24,53].…”

Section: Dataset Sizementioning

confidence: 99%

“…Although some frameworks (e.g., TensorFlow) support local shuffling after sequentially reading a few elements from a batched file, randomly reading small raw images is a general practice to ensure the randomization of an input sequence. These massive small random reads impose non-trivial performance loss compared to sequential reads of large batched files [24,53].…”

Section: Random Filementioning

confidence: 99%

See 1 more Smart Citation

I/O Characterization and Performance Evaluation of BeeGFS for Deep Learning

Chowdhury

Zhu

Heer

et al. 2019

Proceedings of the 48th International Conference on Parallel Processing

Self Cite

View full text Add to dashboard Cite

Parallel File Systems (PFSs) are frequently deployed on leadership High Performance Computing (HPC) systems to ensure efficient I/O, persistent storage and scalable performance. Emerging Deep Learning (DL) applications incur new I/O and storage requirements to HPC systems with batched input of small random files. This mandates PFSs to have commensurate features that can meet the needs of DL applications. BeeGFS is a recently emerging PFS that has grabbed the attention of the research and industry world because of its performance, scalability and ease of use. While emphasizing a systematic performance analysis of BeeGFS, in this paper, we present the architectural and system features of BeeGFS, and perform an experimental evaluation using cutting-edge I/O, Metadata and DL application benchmarks. Particularly, we have utilized AlexNet and ResNet-50 models for the classification of ImageNet dataset using the Livermore Big Artificial Neural Network Toolkit (LBANN), and ImageNet data reader pipeline atop TensorFlow and Horovod. Through extensive performance characterization of BeeGFS, our study provides a useful documentation on how to leverage BeeGFS for the emerging DL applications.

show abstract

Section: Dataset Sizementioning

confidence: 99%

Section: Random Filementioning

confidence: 99%

I/O Characterization and Performance Evaluation of BeeGFS for Deep Learning

Chowdhury

Zhu

Heer

et al. 2019

Proceedings of the 48th International Conference on Parallel Processing

Self Cite

View full text Add to dashboard Cite

show abstract

“…Regarding the issue of I/O and storage for deep learning, both the HPC and deep learning communities have, so far, dedicated most efforts to access large training datasets efficiently [10], [11], [12], [13], while leaving the problem of optimized checkpointing of learning models largely ignored. TensorFlow checkpoints model to files in its SavedModel format, 1 or in HDF5 files through Keras.…”

Section: Related Workmentioning

confidence: 99%

DeepFreeze: Towards Scalable Asynchronous Checkpointing of Deep Learning Models

Nicolae

Wozniak

et al. 2020

2020 20th IEEE/ACM International Symposium on Cluster, Cloud and Internet Computing (CCGRID)

View full text Add to dashboard Cite

In the age of big data, deep learning has emerged as a powerful tool to extract insight and exploit its value, both in industry and scientific applications. One common pattern emerging in such applications is frequent checkpointing of the state of the learning model during training, needed in a variety of scenarios: analysis of intermediate states to explain features and correlations with training data, exploration strategies involving alternative models that share a common ancestor, knowledge transfer, resilience, etc. However, with increasing size of the learning models and popularity of distributed data-parallel training approaches, simple checkpointing techniques used so far face several limitations: low serialization performance, blocking I/O, stragglers due to the fact that only a single process is involved in checkpointing. This paper proposes a checkpointing technique specifically designed to address the aforementioned limitations, introducing efficient asynchronous techniques to hide the overhead of serialization and I/O, and distribute the load over all participating processes. Experiments with two deep learning applications (CANDLE and ResNet) on a pre-Exascale HPC platform (Theta) shows significant improvement over state-of-art, both in terms of checkpointing duration and runtime overhead. Index Terms-checkpointing; deep learning; fine-grain asynchronous I/O; multi-level data persistence

show abstract

“…DNN model checkpointing: The problem of checkpointing DNN models efficiently is beginning to emerge in deep learning, where most efforts so far focus on efficient access of training batches [28], [29], [30], [31]. TensorFlow checkpoints model to files in its SavedModel format, 1 or in HDF5 files through Keras.…”

Section: Background and Problem Formulationmentioning

confidence: 99%

DeepClone: Lightweight State Replication of Deep Learning Models for Data Parallel Training

Nicolae

Wozniak

Dorier

et al. 2020

2020 IEEE International Conference on Cluster Computing (CLUSTER)

View full text Add to dashboard Cite

Training modern deep neural network (DNN) models involves complex workflows triggered by model exploration, sensitivity analysis, explainability, etc. A key primitive in this context is the ability to clone a model training instance, i.e. "fork" the training process in a potentially different direction, which enables comparisons of different evolution paths using variations of training data and model parameters. However, in a quest improve the training throughput, a mix of data parallel, model parallel, pipeline parallel and layer-wise parallel approaches are making the problem of cloning highly complex. In this paper, we explore the problem of efficient cloning under such circumstances. To this end, we leverage several properties of dataparallel training and layer-wise parallelism to design DeepClone, a cloning approach based on augmenting the execution graph to gain direct access to tensors, which are then sharded and reconstructed asynchronously in order to minimize runtime overhead, standby duration, readiness duration. Compared with state-of-art approaches, DeepClone shows orders of magnitude improvement for several classes of DNN models.

show abstract

Entropy-Aware I/O Pipelining for Large-Scale Deep Learning on HPC Systems

Cited by 57 publications

References 15 publications

I/O Characterization and Performance Evaluation of BeeGFS for Deep Learning

I/O Characterization and Performance Evaluation of BeeGFS for Deep Learning

DeepFreeze: Towards Scalable Asynchronous Checkpointing of Deep Learning Models

DeepClone: Lightweight State Replication of Deep Learning Models for Data Parallel Training

Contact Info

Product

Resources

About