Efficient User-Level Storage Disaggregation for Deep Learning

Zhu, Yuemin; Yu, Weikuan; Jiao, Bing; Mohror, Kathryn; Moody, Adam; Chowdhury, Fahim

doi:10.1109/cluster.2019.8891023

Cited by 30 publications

(14 citation statements)

References 32 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Quiver [29] implements a distributed cache on cloud VMs' local SSD storage, optimizing for random ordering for hyperparameter tuning jobs. Similarly, DeepIO [52] and DLFS [53] leverage hardware support in the form of RDMA and NVMeOF to provide randomized minibatches from storage. DIESEL [47] co-designs storage and caching to provide efficient randomized minibatches for small files.…”

Section: Related Workmentioning

confidence: 99%

“…At Facebook, our scale presents a host of complex and novel challenges for ML data ingestion. While recent work has explored various isolated components of the data ingestion pipeline such as preprocessing [38], filesystems [53], reading [24,52], or caching [29,36], there is still relatively little understanding of the end-to-end challenges and requirements in industry-scale environments like ours. This paper presents an in-depth analysis of ML data ingestion requirements at scale, and how we architect and optimize our end-to-end data ingestion pipeline for these requirements.…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Understanding Data Storage and Ingestion for Large-Scale Deep Recommendation Model Training

Zhao,

Agarwal,

Basant

et al. 2021

Preprint

View full text Add to dashboard Cite

show abstract

Section: Related Workmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

Understanding Data Storage and Ingestion for Large-Scale Deep Recommendation Model Training

Zhao,

Agarwal,

Basant

et al. 2021

Preprint

View full text Add to dashboard Cite

show abstract

“…Regarding the issue of I/O and storage for deep learning, both the HPC and deep learning communities have, so far, dedicated most efforts to access large training datasets efficiently [10], [11], [12], [13], while leaving the problem of optimized checkpointing of learning models largely ignored. TensorFlow checkpoints model to files in its SavedModel format, 1 or in HDF5 files through Keras.…”

Section: Related Workmentioning

confidence: 99%

DeepFreeze: Towards Scalable Asynchronous Checkpointing of Deep Learning Models

Nicolae

Wozniak

et al. 2020

2020 20th IEEE/ACM International Symposium on Cluster, Cloud and Internet Computing (CCGRID)

View full text Add to dashboard Cite

In the age of big data, deep learning has emerged as a powerful tool to extract insight and exploit its value, both in industry and scientific applications. One common pattern emerging in such applications is frequent checkpointing of the state of the learning model during training, needed in a variety of scenarios: analysis of intermediate states to explain features and correlations with training data, exploration strategies involving alternative models that share a common ancestor, knowledge transfer, resilience, etc. However, with increasing size of the learning models and popularity of distributed data-parallel training approaches, simple checkpointing techniques used so far face several limitations: low serialization performance, blocking I/O, stragglers due to the fact that only a single process is involved in checkpointing. This paper proposes a checkpointing technique specifically designed to address the aforementioned limitations, introducing efficient asynchronous techniques to hide the overhead of serialization and I/O, and distribute the load over all participating processes. Experiments with two deep learning applications (CANDLE and ResNet) on a pre-Exascale HPC platform (Theta) shows significant improvement over state-of-art, both in terms of checkpointing duration and runtime overhead. Index Terms-checkpointing; deep learning; fine-grain asynchronous I/O; multi-level data persistence

show abstract

“…DNN model checkpointing: The problem of checkpointing DNN models efficiently is beginning to emerge in deep learning, where most efforts so far focus on efficient access of training batches [28], [29], [30], [31]. TensorFlow checkpoints model to files in its SavedModel format, 1 or in HDF5 files through Keras.…”

Section: Background and Problem Formulationmentioning

confidence: 99%

DeepClone: Lightweight State Replication of Deep Learning Models for Data Parallel Training

Nicolae

Wozniak

Dorier

et al. 2020

2020 IEEE International Conference on Cluster Computing (CLUSTER)

View full text Add to dashboard Cite

Training modern deep neural network (DNN) models involves complex workflows triggered by model exploration, sensitivity analysis, explainability, etc. A key primitive in this context is the ability to clone a model training instance, i.e. "fork" the training process in a potentially different direction, which enables comparisons of different evolution paths using variations of training data and model parameters. However, in a quest improve the training throughput, a mix of data parallel, model parallel, pipeline parallel and layer-wise parallel approaches are making the problem of cloning highly complex. In this paper, we explore the problem of efficient cloning under such circumstances. To this end, we leverage several properties of dataparallel training and layer-wise parallelism to design DeepClone, a cloning approach based on augmenting the execution graph to gain direct access to tensors, which are then sharded and reconstructed asynchronously in order to minimize runtime overhead, standby duration, readiness duration. Compared with state-of-art approaches, DeepClone shows orders of magnitude improvement for several classes of DNN models.

show abstract

Efficient User-Level Storage Disaggregation for Deep Learning

Cited by 30 publications

References 32 publications

Understanding Data Storage and Ingestion for Large-Scale Deep Recommendation Model Training

Understanding Data Storage and Ingestion for Large-Scale Deep Recommendation Model Training

DeepFreeze: Towards Scalable Asynchronous Checkpointing of Deep Learning Models

DeepClone: Lightweight State Replication of Deep Learning Models for Data Parallel Training

Contact Info

Product

Resources

About