2022
DOI: 10.32604/cmc.2022.023318
|View full text |Cite
|
Sign up to set email alerts
|

Multi-Agent Deep Reinforcement Learning-Based Resource Allocation in HPC/AI Converged Cluster

Abstract: As the complexity of deep learning (DL) networks and training data grows enormously, methods that scale with computation are becoming the future of artificial intelligence (AI) development. In this regard, the interplay between machine learning (ML) and high-performance computing (HPC) is an innovative paradigm to speed up the efficiency of AI research and development. However, building and operating an HPC/AI converged system require broad knowledge to leverage the latest computing, networking, and storage te… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1

Citation Types

0
3
0

Year Published

2022
2022
2024
2024

Publication Types

Select...
2
2
1

Relationship

0
5

Authors

Journals

citations
Cited by 5 publications
(3 citation statements)
references
References 22 publications
0
3
0
Order By: Relevance
“…• Resource Allocation: AI optimizes resource utilization in HPC clusters by dynamically (Chien, Lai, and Chao 2019) allocating computing resources based on workload demands or previous runs (Narantuya et al 2022). This leads to cost savings and improved efficiency in resource usage.…”
Section: Hpc and Ai Synergymentioning
confidence: 99%
“…• Resource Allocation: AI optimizes resource utilization in HPC clusters by dynamically (Chien, Lai, and Chao 2019) allocating computing resources based on workload demands or previous runs (Narantuya et al 2022). This leads to cost savings and improved efficiency in resource usage.…”
Section: Hpc and Ai Synergymentioning
confidence: 99%
“…One unique study implemented by Narantuya et al utilized a multi-agent DRL (mDRL) based on a DQN to optimize computational resource allocation in high-performance computing (HPC)/AI systems. Their system was further deployed in real-time, reducing the task completion time by 20% and the energy consumption by 40% [119]. Finally, Beimann et al conducted a comparative analysis of four different DRL methods for the control of a simulated HVAC system of a data centre.…”
Section: Datacentersmentioning
confidence: 99%
“…Reinforcement learning (RL) has been recently adopted to solve cloud and edgecomputing resource allocation problems [4][5][6][7][8][9], and specifically container placement [10,11]. Busoniu et al [12] presented a comprehensive survey of multi-agents where the agents are capable of discovering a solution on their own using reinforcement learning.…”
Section: Introductionmentioning
confidence: 99%