2022
DOI: 10.1007/s10586-022-03798-7
|View full text |Cite
|
Sign up to set email alerts
|

A container-based workflow for distributed training of deep learning algorithms in HPC clusters

Abstract: Deep learning has been postulated as a solution for numerous problems in different branches of science. Given the resource-intensive nature of these models, they often need to be executed on specialized hardware such graphical processing units (GPUs) in a distributed manner. In the academic field, researchers get access to this kind of resources through High Performance Computing (HPC) clusters. This kind of infrastructures make the training of these models difficult due to their multi-user nature and limited … Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2

Citation Types

0
2
0

Year Published

2022
2022
2024
2024

Publication Types

Select...
2
2
1
1

Relationship

1
5

Authors

Journals

citations
Cited by 8 publications
(2 citation statements)
references
References 51 publications
0
2
0
Order By: Relevance
“…In this nodes using one GPU, CNN-DeepESD training takes about 2 seconds per epoch, CNN-PAN 3 seconds and CNN-UNET 50 seconds. To ease the reproducibility of these experiments, as well as to simplify their execution on HPC clusters, we provide the required scripts to follow the workflow presented in (González-Abad, López García, & Kozlov, 2022). Dockerfiles are also available in the GitHub repository.…”
Section: Data and Code Availability Statementmentioning
confidence: 99%
“…In this nodes using one GPU, CNN-DeepESD training takes about 2 seconds per epoch, CNN-PAN 3 seconds and CNN-UNET 50 seconds. To ease the reproducibility of these experiments, as well as to simplify their execution on HPC clusters, we provide the required scripts to follow the workflow presented in (González-Abad, López García, & Kozlov, 2022). Dockerfiles are also available in the GitHub repository.…”
Section: Data and Code Availability Statementmentioning
confidence: 99%
“…We train the models in nodes equipped with graphical processing units (GPUs), more specifically NVIDIA Tesla V100 GPUs. In this nodes using one GPU, CNN‐DeepESD training takes about 2 s per epoch, CNN‐PAN 3 s and CNN‐UNET 50 s. To ease the reproducibility of these experiments, as well as to simplify their execution on HPC clusters, we provide the required scripts to follow the workflow presented in González‐Abad, López García, and Kozlov (2022)). Dockerfiles are also included with the code.…”
Section: Data Availability Statementmentioning
confidence: 99%