Increasing scale and heterogeneity in data centers have led to the development of federated clusters such as KubeFed, Hydra, and Pigeon, that federate individual data center clusters. In our work, we introduce Megha, a novel decentralized resource management framework for such federated clusters. Megha employs flexible logical partitioning of clusters to distribute its scheduling load, ensuring that the requirements of the workload are satisfied with very low scheduling overheads. It uses a distributed global scheduler that does not rely on a centralized data store but, instead, works with eventual consistency, unlike other schedulers that use a tiered architecture or rely on centralized databases. Our experiments with Megha show that it can schedule tasks taking into account fairness and placement constraints with low resource allocation times -in the order of tens of milliseconds.
Cloud providers place tasks from multiple applications on the same resource pool to improve the resource utilization of the infrastructure. The consequent resource contention has an undesirable effect on latency-sensitive tasks. In this article, we present Niyama-a resource isolation approach that uses a modified version of deadline scheduling to protect latency-sensitive tasks from CPU bandwidth contention.Conventionally, deadline scheduling has been used to schedule real-time tasks with well-defined deadlines. Therefore, it cannot be used directly when the deadlines are unspecified. In Niyama, we estimate deadlines in intervals and secure bandwidth required for the interval, thereby ensuring optimal job response times. We compare our approach with cgroups: Linux's default resource isolation mechanism used in containers today. Our experiments show that Niyama reduces the average delay in tasks by 3×-20× when compared to cgroups. Since Linux's deadline scheduling policy is work-conserving in nature, there is a small drop in the server-level CPU utilization when Niyama is used naively. We demonstrate how the use of core reservation and oversubscription in the inter-node scheduler can be used to offset this drop; our experiments show a 1.3×-2.24× decrease in delay in job response time over cgroups while achieving high CPU utilization.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.