2020
DOI: 10.48550/arxiv.2012.00825
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

A Study of Checkpointing in Large Scale Training of Deep Neural Networks

Abstract: Deep learning (DL) applications are increasingly being deployed on HPC systems, to leverage the massive parallelism and computing power of those systems for DL model training. While significant effort has been put to facilitate distributed training by DL frameworks, fault tolerance has been largely ignored. In this work, we evaluate checkpoint-restart, a common fault tolerance technique in HPC workloads. We perform experiments with three state-of-the-art DL frameworks common in HPC (Chainer, PyTorch, and Tenso… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1

Citation Types

0
3
0

Year Published

2022
2022
2022
2022

Publication Types

Select...
1
1
1

Relationship

0
3

Authors

Journals

citations
Cited by 3 publications
(3 citation statements)
references
References 9 publications
0
3
0
Order By: Relevance
“…The model Checkpoint option, provided by Keras [14], automatically saves the weights of the best suitable model in terms of a specified metric [25]. We choose to monitor the accuracy value evaluated on validation data so to keep the weights of the model with the higher accuracy.…”
Section: Model Checkpointmentioning
confidence: 99%
“…The model Checkpoint option, provided by Keras [14], automatically saves the weights of the best suitable model in terms of a specified metric [25]. We choose to monitor the accuracy value evaluated on validation data so to keep the weights of the model with the higher accuracy.…”
Section: Model Checkpointmentioning
confidence: 99%
“…A DL job may need multiple GPUs and several parameter servers; in the absence of any of these, it cannot be scheduled to run. (3) Checkpoint mechanism [3], [15]. The model parameters can be saved as a checkpoint file when an epoch is completed.…”
Section: A Features Of Deep Learning Jobsmentioning
confidence: 99%
“…The model is fitted to the training dataset; this function trains the model for a fixed number of epochs. In this phase, the dataset information is read, and checkpoints [15] are generated according to the configuration used. After this training, an evaluation is performed to confirm that the model is working as desired.…”
Section: File Access Patternmentioning
confidence: 99%