2022
DOI: 10.1109/tc.2021.3104747
|View full text |Cite
|
Sign up to set email alerts
|

Resilient Scheduling of Moldable Parallel Jobs to Cope With Silent Errors

Abstract: We study the resilient scheduling of moldable parallel jobs on highperformance computing (HPC) platforms. Moldable jobs allow for choosing a processor allocation before execution, and their execution time obeys various speedup models. The objective is to minimize the overall completion time of the jobs, or the makespan, when jobs can fail due to silent errors and hence may need to be re-executed after each failure until successful completion. Our work generalizes the classical scheduling framework for failure-… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
13
0

Year Published

2022
2022
2024
2024

Publication Types

Select...
3
1
1

Relationship

2
3

Authors

Journals

citations
Cited by 5 publications
(13 citation statements)
references
References 48 publications
0
13
0
Order By: Relevance
“…We have recently investigated the problem of scheduling independent moldable tasks subject to failures [3], where tasks need to be re-executed after a failure until a successful completion. This corresponds to a semi-online setting, since all tasks are known at the beginning, but failed tasks are only discovered on-the-fly.…”
Section: Related Workmentioning
confidence: 99%
See 4 more Smart Citations
“…We have recently investigated the problem of scheduling independent moldable tasks subject to failures [3], where tasks need to be re-executed after a failure until a successful completion. This corresponds to a semi-online setting, since all tasks are known at the beginning, but failed tasks are only discovered on-the-fly.…”
Section: Related Workmentioning
confidence: 99%
“…It consists of two steps. The first step performs an initial allocation for the task, which is inspired by the Local Processor Allocation (LPA) strategy proposed in [2], [3]. Specifically, for each possible allocation p ∈ [1, p max j ], we define the ratio between the area of the task and the minimum area to be α p = a j (p)/a min j , and the ratio between the execution time of the task and the minimum execution time to be β p = t j (p)/t min j .…”
Section: Algorithm Descriptionmentioning
confidence: 99%
See 3 more Smart Citations