Proceedings of the Twelfth European Conference on Computer Systems 2017
DOI: 10.1145/3064176.3064182
|View full text |Cite
|
Sign up to set email alerts
|

Proteus

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

0
5
0

Year Published

2019
2019
2023
2023

Publication Types

Select...
5
2
1

Relationship

0
8

Authors

Journals

citations
Cited by 68 publications
(5 citation statements)
references
References 19 publications
0
5
0
Order By: Relevance
“…Several strategies have been put out to deal with these limitations and improve cloud computing's capabilities. For instance, Proteus, an elastic PS framework created to scale up training on public clouds, was introduced by Harlap et al [89]. The framework dynamically assigns PSs and personnel using three transitional stages, maximizing cost reductions, particularly when temporary revocable resources become available.…”
Section: Parameter Servermentioning
confidence: 99%
“…Several strategies have been put out to deal with these limitations and improve cloud computing's capabilities. For instance, Proteus, an elastic PS framework created to scale up training on public clouds, was introduced by Harlap et al [89]. The framework dynamically assigns PSs and personnel using three transitional stages, maximizing cost reductions, particularly when temporary revocable resources become available.…”
Section: Parameter Servermentioning
confidence: 99%
“…With the need for training DNNs beyond convolutional neural networks [72], [73], DNN training remains a challenging cloud engineering problem -even with the emergence of serverless training services [74], [75]. Additionally, as the training scenario shifts from dedicated clusters (one training job per cluster) to shared clusters, problems such as resource provisioning (with cheap transient resources) [76], [77] and GPU scheduling [78]- [80] still remain unresolved to effectively trade-off cluster utilization and training accuracy and throughput. Edge resources and micro-data centers close to data sources can be utilized for distributed DL training [81], which requires solving challenges of data distribution and resource heterogeneity.…”
Section: Iot and Ai Are Becoming The Main Applicationsmentioning
confidence: 99%
“…Narrow Applicability: Some existing approaches often suffer from limited applicability, restricting their effectiveness in specific scenarios. Proteus [11] only considers distributed learning environments with a Parameter Server (PS) architecture, and as the number of Spot VMs increases, the proportion of On-Demand VM usage also increases, reducing cost-efficiency. Spotnik [12] is limited to situations where only part of the cluster faces preemption and is incapable of avoiding complete checkpoint approaches when the entire cluster is revoked.…”
Section: Existing Approaches and Their Limitationsmentioning
confidence: 99%
“…Limitations of Existing Approaches: There is a very small effort to construct cost-effective clusters that leverage both Spot and On-Demand VMs. Some studies [11,12,13,14] are limited by their narrow applicability, as they are effective only in certain usecases and/or are tailored for specific architectures. Other approaches [15,16,17,18] fail adequately propose methods for cluster configuration, lack comprehensive analysis or experimentation in complex scenarios, and fail to simultaneously consider both price and performance.…”
mentioning
confidence: 99%