VeloC: Towards High Performance Adaptive Asynchronous Checkpointing at Large Scale

Nicolae, Bogdan; Moody, Adam; Gonsiorowski, Elsa; Mohror, Kathryn; Cappello, Franck

doi:10.1109/ipdps.2019.00099

Cited by 67 publications

(52 citation statements)

References 21 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Encouraged by these results, we plan to explore in future work how to design advanced asynchronous checkpointing techniques to preserve the state of models at high frequency by taking advantage of the observation that checkpointing is an immutable operation. To this end, we plan to leverage VeloC [28], a large-scale checkpointing system that features asynchronous management of deep storage hierarchies.…”

Section: Discussionmentioning

confidence: 99%

Understanding Scalability and Fine-Grain Parallelism of Synchronous Data Parallel Training

Nicolae

Wozniak

et al. 2019

2019 IEEE/ACM Workshop on Machine Learning in High Performance Computing Environments (MLHPC)

Self Cite

View full text Add to dashboard Cite

In the age of big data, deep learning has emerged as a powerful tool to extract insight and exploit its value, both in industry and scientific applications. With increasing complexity of learning models and amounts of training data, data-parallel approaches based on frequent all-reduce synchronization steps are increasingly popular. Despite the fact that high performance computing (HPC) technologies have been designed to address such patterns efficiently, the behavior of data-parallel approaches on HPC platforms is not well understood. To address this issue, in this paper we study the behavior of Horovod, a popular data parallel approach that relies on MPI, on Theta, a pre-Exascale machine at Argonne National Laboratory. Using two representative applications, we explore two aspects: (1) how performance and scalability is affected by important parameters such as number of nodes, number of workers, threads per node, batch size; (2) how computational phases are interleaved with all-reduce communication phases at fine granularity and what consequences this interleaving has in terms of potential bottlenecks. Our findings show that pipelining of back-propagation, gradient reduction and weight updates mitigate the effects of stragglers during all-reduce only partially. Furthermore, there can be significant delays between weight updates, which can be leveraged to mask the overhead of additional background operations that are coupled with the training.

show abstract

Section: Discussionmentioning

confidence: 99%

Understanding Scalability and Fine-Grain Parallelism of Synchronous Data Parallel Training

Nicolae

Wozniak

et al. 2019

2019 IEEE/ACM Workshop on Machine Learning in High Performance Computing Environments (MLHPC)

Self Cite

View full text Add to dashboard Cite

show abstract

“…Works representative of this approach include (SCR) [2] and FTI) [3], which introduce support for local storage, partner replication, erasure coding (XOR and Reed-Solomon [4]) and finally external storage (parallel file systems). Recent efforts such as VELOC can take advantage of heterogeneous storage for each level and introduce advanced asynchronous techniques that leverage synergies between the levels [5] and predictions of application behavior to mitigate interference [6].…”

Section: Related Workmentioning

confidence: 99%

“…To address this problem, we propose a transparent solution that automatically detects, mixes and matches heterogeneous storage using vendor-specific APIs when available for optimal performance. This is done in close coordination with asynchronous multi-level checkpointing, introducing awareness of fine-grain I/O operations and optimal flushing strategies based on producer-consumer strategies that rely on performance modeling [5]. c) Efficient serialization on local storage: Even when advanced asynchronous techniques are employed for multilevel checkpointing, serialization to local storage can still incur significant overhead.…”

Section: A Design Principlesmentioning

confidence: 99%

“…It consists of three major components: VELOC, a low overhead runtime specifically designed for scalable, highperformance asynchronous multi-level checkpointing for HPC applications [5], a checkpointing module responsible to capture tensors to local storage and a bindings library that interfaces the checkpointing module with VELOC. Both the checkpointing module and the bindings library are new components written from scratch and integrated with VELOC.…”

Section: B Architecturementioning

confidence: 99%

See 1 more Smart Citation

DeepFreeze: Towards Scalable Asynchronous Checkpointing of Deep Learning Models

Nicolae

Wozniak

et al. 2020

2020 20th IEEE/ACM International Symposium on Cluster, Cloud and Internet Computing (CCGRID)

Self Cite

View full text Add to dashboard Cite

In the age of big data, deep learning has emerged as a powerful tool to extract insight and exploit its value, both in industry and scientific applications. One common pattern emerging in such applications is frequent checkpointing of the state of the learning model during training, needed in a variety of scenarios: analysis of intermediate states to explain features and correlations with training data, exploration strategies involving alternative models that share a common ancestor, knowledge transfer, resilience, etc. However, with increasing size of the learning models and popularity of distributed data-parallel training approaches, simple checkpointing techniques used so far face several limitations: low serialization performance, blocking I/O, stragglers due to the fact that only a single process is involved in checkpointing. This paper proposes a checkpointing technique specifically designed to address the aforementioned limitations, introducing efficient asynchronous techniques to hide the overhead of serialization and I/O, and distribute the load over all participating processes. Experiments with two deep learning applications (CANDLE and ResNet) on a pre-Exascale HPC platform (Theta) shows significant improvement over state-of-art, both in terms of checkpointing duration and runtime overhead. Index Terms-checkpointing; deep learning; fine-grain asynchronous I/O; multi-level data persistence

show abstract

“…In this regard, our approach can take advantage of VeloC [156], an exascale-ready checkpointing system that leverages heterogeneous storage hierarchies to implement multilevel resilience strategies. Two key features of VeloC are particularly interesting in this context: (1) it exposes a memory-based API that is well suited to protect the critical data structures stored in main memory by DIY; and (2) it implements an asynchronous mechanism that hides the overhead of the resilience strategies in the background, while DIY continues running.…”

Section: ) Implementation Of the Unified Distributed Data Abstractionmentioning

confidence: 99%

Toward High-Performance Computing and Big Data Analytics Convergence: The Case of Spark-DIY

et al. 2019

Self Cite

View full text Add to dashboard Cite

Convergence between high-performance computing (HPC) and big data analytics (BDA) is currently an established research area that has spawned new opportunities for unifying the platform layer and data abstractions in these ecosystems. This work presents an architectural model that enables the interoperability of established BDA and HPC execution models, reflecting the key design features that interest both the HPC and BDA communities, and including an abstract data collection and operational model that generates a unified interface for hybrid applications. This architecture can be implemented in different ways depending on the process-and data-centric platforms of choice and the mechanisms put in place to effectively meet the requirements of the architecture. The Spark-DIY platform is introduced in the paper as a prototype implementation of the architecture proposed. It preserves the interfaces and execution environment of the popular BDA platform Apache Spark, making it compatible with any Spark-based application and tool, while providing efficient communication and kernel execution via DIY, a powerful communication pattern library built on top of MPI. Later, Spark-DIY is analyzed in terms of performance by building a representative use case from the hydrogeology domain, EnKF-HGS. This application is a clear example of how current HPC simulations are evolving toward hybrid HPC-BDA applications, integrating HPC simulations within a BDA environment. INDEX TERMS Big data analytics, high performance computing, spark, DIY, MPI.

show abstract

VeloC: Towards High Performance Adaptive Asynchronous Checkpointing at Large Scale

Cited by 67 publications

References 21 publications

Understanding Scalability and Fine-Grain Parallelism of Synchronous Data Parallel Training

Understanding Scalability and Fine-Grain Parallelism of Synchronous Data Parallel Training

DeepFreeze: Towards Scalable Asynchronous Checkpointing of Deep Learning Models

Toward High-Performance Computing and Big Data Analytics Convergence: The Case of Spark-DIY

Contact Info

Product

Resources

About