2019 IEEE International Parallel and Distributed Processing Symposium (IPDPS) 2019
DOI: 10.1109/ipdps.2019.00099
|View full text |Cite
|
Sign up to set email alerts
|

VeloC: Towards High Performance Adaptive Asynchronous Checkpointing at Large Scale

Abstract: Global checkpointing to external storage (e.g., a parallel file system) is a common I/O pattern of many HPC applications. However, given the limited I/O throughput of external storage, global checkpointing can often lead to I/O bottlenecks. To address this issue, a shift from synchronous checkpointing (i.e., blocking until writes have finished) to asynchronous checkpointing (i.e., writing to faster local storage and flushing to external storage in the background) is increasingly being adopted. However, with ri… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

0
52
0

Year Published

2019
2019
2023
2023

Publication Types

Select...
5
2
1

Relationship

5
3

Authors

Journals

citations
Cited by 67 publications
(52 citation statements)
references
References 21 publications
0
52
0
Order By: Relevance
“…Encouraged by these results, we plan to explore in future work how to design advanced asynchronous checkpointing techniques to preserve the state of models at high frequency by taking advantage of the observation that checkpointing is an immutable operation. To this end, we plan to leverage VeloC [28], a large-scale checkpointing system that features asynchronous management of deep storage hierarchies.…”
Section: Discussionmentioning
confidence: 99%
“…Encouraged by these results, we plan to explore in future work how to design advanced asynchronous checkpointing techniques to preserve the state of models at high frequency by taking advantage of the observation that checkpointing is an immutable operation. To this end, we plan to leverage VeloC [28], a large-scale checkpointing system that features asynchronous management of deep storage hierarchies.…”
Section: Discussionmentioning
confidence: 99%
“…Works representative of this approach include (SCR) [2] and FTI) [3], which introduce support for local storage, partner replication, erasure coding (XOR and Reed-Solomon [4]) and finally external storage (parallel file systems). Recent efforts such as VELOC can take advantage of heterogeneous storage for each level and introduce advanced asynchronous techniques that leverage synergies between the levels [5] and predictions of application behavior to mitigate interference [6].…”
Section: Related Workmentioning
confidence: 99%
“…To address this problem, we propose a transparent solution that automatically detects, mixes and matches heterogeneous storage using vendor-specific APIs when available for optimal performance. This is done in close coordination with asynchronous multi-level checkpointing, introducing awareness of fine-grain I/O operations and optimal flushing strategies based on producer-consumer strategies that rely on performance modeling [5]. c) Efficient serialization on local storage: Even when advanced asynchronous techniques are employed for multilevel checkpointing, serialization to local storage can still incur significant overhead.…”
Section: A Design Principlesmentioning
confidence: 99%
See 1 more Smart Citation
“…In this regard, our approach can take advantage of VeloC [156], an exascale-ready checkpointing system that leverages heterogeneous storage hierarchies to implement multilevel resilience strategies. Two key features of VeloC are particularly interesting in this context: (1) it exposes a memory-based API that is well suited to protect the critical data structures stored in main memory by DIY; and (2) it implements an asynchronous mechanism that hides the overhead of the resilience strategies in the background, while DIY continues running.…”
Section: ) Implementation Of the Unified Distributed Data Abstractionmentioning
confidence: 99%