Instability in Geo-Distributed Kubernetes Federation: Causes and Mitigation

Tamiru, Mulugeta Ayalew; Pierre, Guillaume; Tordsson, Johan; Elmroth, Erik

doi:10.1109/mascots50786.2020.9285934

Cited by 5 publications

(4 citation statements)

References 26 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…), and Ψ j,v = min r∈Γ (Ψ r j,v ), (10) extracting the minimum affinity value among the existing resources, hence considering the most conservative scenario.…”

Section: F Dealing With Multiple Computing Resourcesmentioning

confidence: 99%

“…Indeed, these solutions have been designed for centralized data centers, with guarantees of computing and network resources, and are not designed to identify suitable microservice placement considering their communication patterns. Therefore, they fail to scale on geographically distributed edge-like infrastructures seamlessly, specifically when dealing with nodes that are geographically spread over highlatency WANs [10]- [12].…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Scheduling Multi-Component Applications Across Federated Edge Clusters With Phare

Castellano,

Galantino,

Risso

et al. 2024

IEEE Open J. Commun. Soc.

View full text Add to dashboard Cite

The shift towards agile microservice architecture has enabled significant benefits for IT companies but has also resulted in increased complexity for Cloud orchestration tools. Traditional tools were designed for centralized data centers and are ineffective for locating microservices in geographically-distributed edge-like infrastructures. This paper presents Phare, a decentralized scheduling algorithm designed to optimize the placement of microservices by satisfying their computing and communication demands while minimizing deployment costs. Phare employs a heuristic-based approach to solve the NP-Hard scheduling problem, prioritizing the microservices with the more stringent requirements and placing them on the most convenient computing facilities, based on the concept of affinity, contributing to the field by providing a more holistic approach to resource scheduling in edge computing. We validate our approach against Firmament, the state-of-the-art workload scheduling algorithm for component-based applications, on simulated edge infrastructures with hundreds of clusters. Phare achieves up to a 10× reduction in terms of deployment costs compared to Firmament while providing a much lower scheduling latency.

show abstract

“…), and Ψ j,v = min r∈Γ (Ψ r j,v ), (10) extracting the minimum affinity value among the existing resources, hence considering the most conservative scenario.…”

Section: F Dealing With Multiple Computing Resourcesmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

Scheduling Multi-Component Applications Across Federated Edge Clusters With Phare

Castellano,

Galantino,

Risso

et al. 2024

IEEE Open J. Commun. Soc.

View full text Add to dashboard Cite

show abstract

“…V. In Sec. VI, we discuss the experimental setup and Several DCI implementations involving IoT deployments 183 emphasised the DCI benefits to Edge computing use 184 cases [6], [7], [8], [9], [10], [11], [12], [13]. The imple-185 mented DCIs in these studies, as shown in Fig.…”

Section: Paper Organizationmentioning

confidence: 99%

“…• Third, despite the efforts toward federated Kubernetes clusters, recent implementations show that Kubernetes federation controllers cannot scale to a sufficient size for Edge computing use cases [9] and can lead to Edge workload deployment instability [10]. Furthermore, Kubernetes-based distributed workload deployments (e.g., multi-cluster and multi-cloud) necessitate the use of additional cluster management tools (e.g., Open Cluster Management 6 ) to automate cluster registration, work distribution, and dynamic policy and workload placement.…”

Section: Paper Organizationmentioning

confidence: 99%

Edge Workloads Monitoring and Failover: a StarlingX-Based Testbed Implementation and Measurement Study

et al. 2022

View full text Add to dashboard Cite

With the ever-growing amount of time-critical, compute-intensive, and private IoT applications, the need for High Availability (HA) Edge Clouds becomes indispensable. Realizing HA Edge Clouds is inherently challenging due to the geographically-dispersed hierarchy of the Distributed Cloud Infrastructure (DCI). For example, frequent isolation between the central Cloud and Edge Clouds due to networking instability necessitates some autonomous operations at the Edge Clouds. Furthermore, because Edge Clouds have fewer resources than central Clouds, configuring the Edge functions (i.e., control, compute, and storage) in HA clusters will undoubtedly reduce downtime. However, it will limit the Edge scalability. To that end, StarlingX is developing an HA-protected and scalable DCI virtualization platform based on the open-source ecosystem, focusing on low-touch management of Edge Clouds. StarlingX provides a fault management service that realizes DCI-wide alarming and logging capabilities, allowing for rapid response to virtualized infrastructure events. Recently, the IETF Network Working Group proposed that monitoring both the DCI and the Edge workloads (software containers) is critical for an Edge Computing Platform to maintain HA IoT application deployment. Indeed, the possibility of the infrastructure remaining stable and healthy while the workloads suffer a fatal failure simultaneously necessitates failover functionality that monitors both the infrastructure and the Edge workloads. In this paper, we first propose a dynamic failover functionality that centrally monitors Edge workloads to recover from deployment or Edge node failures, motivated by the IETF direction. Second, we experimentally optimize the failover functionality for monitoring a microservice-architected IoT application deployed on a StarlingX-based DCI testbed to collect temperature sensor readings from Raspberry Pis. Regardless of how quickly the Edge workload health checks are collected, the recorded failover measurements reveal that the recovery time will not drop below a predetermined level controlled by Edge resources and network speed. Furthermore, reducing the statistics collection timeout reduces the recovery time of an Edge node failure. When the timeout value is less than the minimum achievable recovery time, false-positive failures (FPFs) can occur. Third, to supplement the StarlingX fault management service, we provide a modular implementation of the proposed failover functionality. Finally, we present the first-ever introduction of the StarlingX platform's software stack to promote its use in academic research.INDEX TERMS Distributed cloud infrastructure, edge computing, failover, IoT, Kubernetes, microservice architecture, StarlingX platform, testbed.The associate editor coordinating the review of this manuscript and approving it for publication was Sathish Kumar .

show abstract

Service Mesh Controller for Cooperative Load Balancing among Neighboring Edge Servers

Furusawa

Abe

Okada

et al. 2022

2022 IEEE International Symposium on Local and Metropolitan Area Networks (LANMAN)

View full text Add to dashboard Cite

Instability in Geo-Distributed Kubernetes Federation: Causes and Mitigation

Cited by 5 publications

References 26 publications

Scheduling Multi-Component Applications Across Federated Edge Clusters With Phare

Scheduling Multi-Component Applications Across Federated Edge Clusters With Phare

Edge Workloads Monitoring and Failover: a StarlingX-Based Testbed Implementation and Measurement Study

Service Mesh Controller for Cooperative Load Balancing among Neighboring Edge Servers

Contact Info

Product

Resources

About