Proceedings of the Platform for Advanced Scientific Computing Conference 2020
DOI: 10.1145/3394277.3401850
|View full text |Cite
|
Sign up to set email alerts
|

Deploying Scientific Al Networks at Petaflop Scale on Secure Large Scale HPC Production Systems with Containers

Abstract: There is an ever-increasing need for computational power to train complex artificial intelligence (AI) & machine learning (ML) models to tackle large scientific problems. High performance computing (HPC) resources are required to efficiently compute and scale complex models across tens of thousands of compute nodes. In this paper, we discuss the issues associated with the deployment of machine learning frameworks on large scale secure HPC systems and how we successfully deployed a standard machine learning fra… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

0
5
0

Year Published

2021
2021
2023
2023

Publication Types

Select...
3
2
1

Relationship

0
6

Authors

Journals

citations
Cited by 8 publications
(5 citation statements)
references
References 15 publications
0
5
0
Order By: Relevance
“…Over the next decade, scientists will see a 10-100 times increase in sensitivity and resolution from their instruments, necessitating a comparable scale-up in data storage and processing capacity. The data derived by these upgraded instruments will push Moore's law to its limits, posing a threat to conventional operating models predicated primarily on HPC in data centers [167,168]. Conventional HPC architectures were developed for simulation-based methods like computational fluid dynamics.…”
Section: High-performance Computingmentioning
confidence: 99%
“…Over the next decade, scientists will see a 10-100 times increase in sensitivity and resolution from their instruments, necessitating a comparable scale-up in data storage and processing capacity. The data derived by these upgraded instruments will push Moore's law to its limits, posing a threat to conventional operating models predicated primarily on HPC in data centers [167,168]. Conventional HPC architectures were developed for simulation-based methods like computational fluid dynamics.…”
Section: High-performance Computingmentioning
confidence: 99%
“…In [16] authors focus on applications oriented to deep learning, for which they use Charliecloud [12]. Subsequently, a similar workflow is applied to the training of a deep learning model in the field of particle physics [17]. The developed workflow allows them to take advantage of the hundreds of CPUs available in their cluster.…”
Section: Related Workmentioning
confidence: 99%
“…This experiment was executed in a different cluster from the previous one, however it was not necessary to apply substantial changes to the workflow. The image built for this experiment 17 was developed using as basis the image from the TensorFlow Benchmark experiment, solely adding the software required for this second experiment. The scheduling of the jobs has been done in the same way in both experiments (following the indications of Sect.…”
Section: Empirical Statistical Downscalingmentioning
confidence: 99%
See 1 more Smart Citation
“…Their results indicate that running computationally intensive jobs on CPUs/GPUs has little overhead compared to running the jobs directly on top of the operating system. David Brayford et al have successfully used containers to deploy three-dimensional convolutional GAN (3DGAN) with petaflop performance on High-Performance Computing (HPC) resources[24]. They state that using containers allowed them to use HPC clusters without sacrificing the security of the cluster, and helped the execution…”
mentioning
confidence: 99%