2020
DOI: 10.1002/cpe.5823
|View full text |Cite
|
Sign up to set email alerts
|

Mary, Hugo, and Hugo*: Learning to schedule distributed data‐parallel processing jobs on shared clusters

Abstract: Summary Distributed data‐parallel processing systems like MapReduce, Spark, and Flink are popular for analyzing large datasets using cluster resources. Resource management systems like YARN or Mesos in turn allow multiple data‐parallel processing jobs to share cluster resources in temporary containers. Often, the containers do not isolate resource usage to achieve high degrees of overall resource utilization despite overprovisioning and the often fluctuating utilization of specific jobs. However, some combinat… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1

Citation Types

0
3
0

Year Published

2021
2021
2024
2024

Publication Types

Select...
3
2

Relationship

2
3

Authors

Journals

citations
Cited by 5 publications
(3 citation statements)
references
References 25 publications
0
3
0
Order By: Relevance
“…Additional challenges are introduced when jobs do not use resources in isolation, but share access and potentially interfere with each other, impeding individual job performance often significantly [29,27].…”
Section: Runtime Data Sharingmentioning
confidence: 99%
“…Additional challenges are introduced when jobs do not use resources in isolation, but share access and potentially interfere with each other, impeding individual job performance often significantly [29,27].…”
Section: Runtime Data Sharingmentioning
confidence: 99%
“…Since SJFN assigns all tasks to the most powerful machines, many tasks have to share the resources. This sharing can lead to interferences, which can get higher with a higher number of competing tasks [41]- [43].…”
Section: E Experimentsmentioning
confidence: 99%
“…Many other works apply reinforcement learning to integrate the exploration of potential solution spaces directly with an optimization towards given objectives such as high resource utilization, low interference, and cluster throughput. In this way, several novel cluster schedulers use either classical or deep reinforcement learning methods to schedule various types of cluster jobs in large data center infrastructures [127]- [129]. Other systems use reinforcement learning, for example, to re-provision and scale microservices towards given service-level objectives [130].…”
Section: G Machine Learning Plays An Increasing Role For Cloud Systemsmentioning
confidence: 99%