With increased demand for computing resources at a lower cost by end-users, cloud infrastructure providers need to find ways to protect their revenue. To achieve this, infrastructure providers aim to increase revenue and lower operational costs. A promising approach to addressing these challenges is to modify the assignment of resources to workloads. This can be used, for example, to consolidate existing workloads; the new capability can be used to serve new requests or alternatively unused resources may be turned off to reduce power consumption. The goal of this paper is to highlight features, approaches and findings in the literature, in order to identify open challenges and facilitate future developments. We present a definition of cloud systems adaptation, a classification of the key features and a survey of adapting compute and storage configuration. Based on our analysis, we identify three open research challenges: characterising the workload type, accurate online profiling of workloads, and building highly scalable adaptation mechanisms.
Solutions based on Reinforcement Learning (RL) have been presented to manage cloud infrastructure, however, these tend to be centralised and suffer in their ability to maintain Quality of Service (QoS) for data centres with thousands of nodes. To address this, we propose a reinforcement learning management policy, which is able to run decentralized, and achieve fast convergence towards efficient resource allocation, resulting in lower SLA violations compared to centralised architectures. To address some of the common challenges in applying RL to cloud resource management, such as slow learning and state/action management, we use parallel learning and reduction of the state/action space. We have also demonstrate unique, multi-level reinforcement learning cooperation, that further reduces SLA violations. We use simulation to evaluate and demonstrate our proposal in practice, and compared the results obtained with an established heuristic, demonstrating significant improvement to SLA violations and higher scalability.
A promising approach to increase the efficiency of infrastructure usage is to adapt the assignment of resources to workloads. This can be used, for example, to consolidate existing workloads so that the new capability can be used to serve new requests, or alternatively unused resources may be turned off to reduce energy consumption. Many architectural solutions have been presented for data centre management, however these tend to be centralised and may suffer in their ability to scale and support data centres with tens of thousands of nodes. Distributed approaches solve the scalability problem, however these do not have a global view of resources across the data centre. To address this, we propose a novel hybrid distributed hierarchical framework that is effective at providing the information needed for decision making at scale. We evaluate the performance of our approach by simulation, and demonstrate that a hybrid approach is a viable solution for managing large data centres, through rapid information dissemination and ability to make decisions using a global view.
Cloud computing is an established paradigm for end users to access resources. Cloud infrastructure providers seek to maximize accepted requests, meet Service Level Agreements (SLAs), and reduce operational costs by dynamically allocating Virtual Machines (VMs) to physical nodes. Many solutions have been presented to manage cloud infrastructure, however, these tend to be centralized and suffer in their ability to maintain Quality of Service (QOS) and support data centers with thousands of nodes.Decentralized approaches, with no central management, can manage large data centers. However, these tend to reduce the ability to obtain an optimal resource allocation across the data center. To address this, we propose a hybrid hierarchical decentralized architecture that achieves lower SLA violations and lowers network traffic. We used simulation to evaluate our proposal in practice with a variety of existing VM placement policies.
Cloud data centres require efficient management of resources and robust methods that consider SLA violations, node utilisation and simplify the adaptation decision making process. Thus resource management should be ideally solved in an online manner. To address this, approaches have been presented in the literature to set thresholds that trigger VM migration. One challenge with these approaches is they typically use node metrics (e.g., CPU and memory) as an indicator of VM performance and do not factor in VM performance metrics when setting the CPU migration threshold. A hypothesis is that migrating VMs without factoring in VM performance metrics, e.g., response time can lead to either early or delayed migration of VMs. We present an approach to discover the CPU utilization level for VM migration dynamically. This approach monitors VM response time and discovers the CPU threshold where response time would increase beyond a defined SLA level and uses this threshold for VM migration. We use reinforcement learning (RL) to learn when it is rewarding to migrate a VM. The RL reward function drives a policy towards high CPU utilisation and attaches a penalty to overachieving SLAs. We use simulation to evaluate the approach and compare it to 4 heuristics: Static, Interquartile Range, Median Absolute Deviation, Local Regression. The results show a significant reduction in SLA violations and an increase in CPU utilization of active nodes.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.