2014 IEEE International Conference on Cluster Computing (CLUSTER) 2014
DOI: 10.1109/cluster.2014.6968751
|View full text |Cite
|
Sign up to set email alerts
|

Balancing job performance with system performance via locality-aware scheduling on torus-connected systems

Abstract: Torus-connected network is widely used in modern supercomputers due to its linear per node cost scaling and its competitive overall performance. Job scheduling system plays a critical role for the efficient use of supercomputers. As supercomputers continue growing in size, a fundamental problem arises: how to effectively balance job performance with system performance on torus-connected machines? In this work, we will present a new scheduling design named window-based locality-aware scheduling. Our design cont… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
7
0

Year Published

2015
2015
2024
2024

Publication Types

Select...
5
3

Relationship

2
6

Authors

Journals

citations
Cited by 15 publications
(7 citation statements)
references
References 15 publications
0
7
0
Order By: Relevance
“…Similar strategies have been recently incorporated within SLURM with 6 (or without 7 ) the use of ALPS. Another interesting work (Yang et al, 2014) adapted only for torus topology and presented a window-based locality-aware job-scheduling strategy that tries to optimize job and system performance in the same time. Its goal is to preserve node contiguity by considering multiple jobs for scheduling while making use of the 0-1 multiple knapsack problem for resource allocation.…”
Section: Related Work and Discussionmentioning
confidence: 99%
“…Similar strategies have been recently incorporated within SLURM with 6 (or without 7 ) the use of ALPS. Another interesting work (Yang et al, 2014) adapted only for torus topology and presented a window-based locality-aware job-scheduling strategy that tries to optimize job and system performance in the same time. Its goal is to preserve node contiguity by considering multiple jobs for scheduling while making use of the 0-1 multiple knapsack problem for resource allocation.…”
Section: Related Work and Discussionmentioning
confidence: 99%
“…Newly inserted or modified key‐value data will be replicated asynchronously to secondary replicas that have closer hashed location. By communicating only with near neighbors, this approach ensures that replicas only consume less network resources when we succeed in implementing the topology‐aware and locality‐aware protocols (similar approach can be found in ). Despite the lack of topology‐aware in the current ZHT, the asynchronous replication only adds relatively small overhead when adding more replicas at modest scales (up to 4K cores).…”
Section: Zht Design and Implementationmentioning
confidence: 99%
“…The underlying CQSim scheduling simulator has been successfully supporting a number of projects in this field over a decade [8,[19][20][21][22][23][24][25][26][27][28][29][30]. CQSim provides a unified platform to evaluate the performance of various methods with minimal overheads.…”
Section: Impactmentioning
confidence: 99%