A Study of Checkpointing in Large Scale Training of Deep Neural Networks

Rojas, Elvis; Kahira, Albert Njoroge; Meneses, Esteban; Bautista-Gomez, Leonardo; Badía, Rosa M.

doi:10.48550/arxiv.2012.00825

Cited by 3 publications

(3 citation statements)

References 9 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…The model Checkpoint option, provided by Keras [14], automatically saves the weights of the best suitable model in terms of a specified metric [25]. We choose to monitor the accuracy value evaluated on validation data so to keep the weights of the model with the higher accuracy.…”

Section: Model Checkpointmentioning

confidence: 99%

A multiclass CNN cascade model for the clinical detection support of cardiac arrhythmia based on subject-exclusive ECG dataset

Liotto¹,

Petrillo

Santini

et al. 2022

Biomed. Eng. Lett.

View full text Add to dashboard Cite

The accurate analysis of Electrocardiogram waveform plays a crucial role for supporting cardiologist in detecting and diagnosing the heartbeat disorders. To improve their detection accuracy, this work is devoted to the design of a novel classification algorithm which is composed of a cascade of two convolutional neural network (CNN), i.e a Binary CNN allowing the detection of the arrhythmic heartbeat and a Multiclass CNN able to recognize the specific disorder. Moreover, by combining the cascade architecture solution with a rule-based data splitting, which leverages the subject-exclusive and balances among the classes criteria, it is possible predicting the health status of unseen patients. Numerical results, carried out considering Massachusetts Institute of Technology-Beth Israel Hospital arrhythmia database, disclose a classification accuracy of $$96.2\%$$ 96.2 % . Finally, a cross-database performance evaluation and a comparison analysis w.r.t. the current state-of-art further disclose the effectiveness and the efficiency of the proposed solution, as well as its benefits in terms of patient health status prediction.

show abstract

Section: Model Checkpointmentioning

confidence: 99%

A multiclass CNN cascade model for the clinical detection support of cardiac arrhythmia based on subject-exclusive ECG dataset

Liotto¹,

Petrillo

Santini

et al. 2022

Biomed. Eng. Lett.

View full text Add to dashboard Cite

show abstract

“…A DL job may need multiple GPUs and several parameter servers; in the absence of any of these, it cannot be scheduled to run. (3) Checkpoint mechanism [3], [15]. The model parameters can be saved as a checkpoint file when an epoch is completed.…”

Section: A Features Of Deep Learning Jobsmentioning

confidence: 99%

PickyMan: A Preemptive Scheduler for Deep Learning Jobs on GPU Clusters

Chen

et al. 2022

2022 IEEE International Performance, Computing, and Communications Conference (IPCCC)

View full text Add to dashboard Cite

Deep learning (DL) jobs normally run on GPU clusters. Some DL jobs need to be scheduled preemptively to avoid long waiting times. However, preempting a DL job is time-consuming, which consists of suspending and resuming. Suspending needs to complete the training process of the current epoch, and resuming needs to reload the model and the training data. The existing schedulers almost do not consider the overhead of preempting jobs; thus, they may preempt jobs with large time loss, increasing the waiting time and the makespan. In this paper, we present PickyMan, a preemptive scheduler to minimize the overhead of preempting jobs to reduce the average waiting time and the makespan. PickyMan has some innovations.(1) Predict execution time using network traffic and database. It predicts the execution time of a DL job by profiling the network traffic from storage nodes to computation nodes and using a database, without the requirement of allocating extra resources from the cluster. It can use profiled information of only four jobs to predict the execution times for other same-model jobs, and most of the predicted errors are less than 10%. (2) Modeling the overhead of preemption. It builds a model to predict the time loss of job suspensions and resumptions with an average error of less than 5%. (3) We abstract the problem of choosing the appropriate jobs for preemption as one of finding an ordered division of the set of running jobs and solve it quickly with a greedy algorithm. By conducting experiments on the small-scale actual cluster and making large-scale simulations, PickyMan reduces the average waiting time by 10%-92% and further reduces the makespan by up to 14%, compared to existing methods.

show abstract

“…The model is fitted to the training dataset; this function trains the model for a fixed number of epochs. In this phase, the dataset information is read, and checkpoints [15] are generated according to the configuration used. After this training, an evaluation is performed to confirm that the model is working as desired.…”

Section: File Access Patternmentioning

confidence: 99%

File Access Patterns of Distributed Deep Learning Applications

Parraga

León

Méndez

et al. 2022

Communications in Computer and Information Science

View full text Add to dashboard Cite

Nowadays, Deep Learning (DL) applications have become a necessary solution for analyzing and making predictions with big data in several areas. However, DL applications introduce heavy input/output (I/O) loads on computer systems. These types of applications, when running on distributed systems or distributed memory parallel systems, handle a large amount of information that must be read in the training stage. Inherently parallel and distributed systems and persistent file accesses can easily overwhelm traditional shared file systems and negatively impact application performance. In this way, the management of these applications constitutes a constant challenge due to their popularity in HPC systems. Scientific applications or simulators have traditionally been executed and are optimized for this type systems. Therefore, it is essential to identify the key factors involved in the I/O of a DL application to find the most appropriate form of configuration to minimize the impact of I/O on the performance of this type of application. In the present work, we present an analysis of the behavior of the patterns generated by I/O operations in the training stage of distributed deep learning applications. We selected two well-known datasets such as CIFAR and MNIST to describe file access patterns.

show abstract

A Study of Checkpointing in Large Scale Training of Deep Neural Networks

Cited by 3 publications

References 9 publications

A multiclass CNN cascade model for the clinical detection support of cardiac arrhythmia based on subject-exclusive ECG dataset

A multiclass CNN cascade model for the clinical detection support of cardiac arrhythmia based on subject-exclusive ECG dataset

PickyMan: A Preemptive Scheduler for Deep Learning Jobs on GPU Clusters

File Access Patterns of Distributed Deep Learning Applications

Contact Info

Product

Resources

About