Fault tolerance in cloud computing environment: A systematic survey

Hasan, Moin; Goraya, Major Singh

doi:10.1016/j.compind.2018.03.027

Cited by 82 publications

(40 citation statements)

References 53 publications

Supporting

Mentioning

Contrasting

Unclassified

Order By: Relevance

“…The problem of fault tolerance is more challenging in cloudbased systems since the cloud computing architecture is highly complex and dynamically growing [30], [31]. According to a recent survey [32], among the traditional fault tolerance mechanisms (e.g., [6], [29], [33]), Checkpoint and Restart (CR) [6] is commonly used for implementing fault tolerance in the Cloud. The CR mechanism is normally utilized in order to restart applications in case of failures (e.g., [34], [35]), while it is also used for migrating tasks and applications from one node of the cloud to another (e.g., [36]) when special circumstances impose it (e.g., node unavailability).…”

Section: Prior Workmentioning

confidence: 99%

Optimum Interval for Application-level Checkpoints

Siavvas

Gelenbe

2019

2019 6th IEEE International Conference on Cyber Security and Cloud Computing (CSCloud)/ 2019 5th IEEE International Conference

View full text Add to dashboard Cite

Checkpointing is commonly adopted for enhancing the performance of software applications that operate in the presence of failures. Among the existing checkpointing strategies, Application-level Checkpoint and Restart (ALCR) is considered the most efficient, since it leaves smaller memory footprint, but it requires significant development effort. Although existing ALCR tools and libraries manage to reduce the effort required for implementing the checkpoints, they do not provide recommendations regarding their inter-checkpoint interval. To this end, in the present paper, we develop a mathematical model to estimate the optimum checkpoint interval, i.e., the interval between two successive checkpoints that minimises the average execution time of the application. The case of programs with loops and nested loops is also discussed. The results are illustrated with several numerical examples.

show abstract

Section: Prior Workmentioning

confidence: 99%

Optimum Interval for Application-level Checkpoints

Siavvas

Gelenbe

2019

2019 6th IEEE International Conference on Cyber Security and Cloud Computing (CSCloud)/ 2019 5th IEEE International Conference

View full text Add to dashboard Cite

show abstract

“…In cloud, a lot of research work is done in the field of energy efficiency [3][4] [5], fault tolerance [6], reputation [17] load balancing [7], and decision making systems for service selection [8] [9] etc. but there is lack of research towards CGR selection problem.…”

Section: A Motivation and Contributionmentioning

confidence: 99%

“…Step 5: Determine the leaving outranking flow and entering the outranking flows . The leaving outranking flow is defined as: (5) The entering outranking flow is defined as: (6) Step 6: Calculate the outranking flow for each CSP (7)…”

Section: (4)mentioning

confidence: 99%

A Ranking Based Model for Selecting Optimum Cloud Geographical Region

Neeraj¹,

Singh²,

Singh³

2019

IJITEE

View full text Add to dashboard Cite

Cloud computing has become a dominant service computing model where the services such as software, platform, infrastructure are provided to the cloud consumers on demand basis using pay as you go model. Every cloud consumer requires high performance service in minimal cost. The performance of service in cloudis measured by the parameters availability, agility, cost, and security etc. The service performance and cost are very much dependent on the cloud geographical region (CGR) where these are deployed. The services offered by the cloud service provider are installed on multiple data centers located at different CGR. In cloud environment, the selection of a service installed on the optimum CGR within the limited time overhead is a challenging and interesting problem. In this paper, the multi criteria decision making method, PROMETHEE II and objective weighting method Shannon’s Entropy,based ranking model is proposed for solving the optimum CGR selection problem. The CGR dataset of Amazon Web Service is used for the numerical analysis. The sensitivity analysis is performed for validating the stability of the proposed model and getting the most sensitive parameter. The applicability and usefulness of the service selection process is validated through the experimental results on synthetic dataset. Results show that the service selection process is achieved withlimited time overhead and hence suitable for online selection process in cloud.

show abstract

“…The tendency of software to fail or cause a system failure after running continuously for a specific time period is referred to as software aging [6,7]. Software aging is a phenomenon in long-run software systems that causes an increased failure rate and/or degraded performance due to accumulation of aging errors [8,9].…”

Section: Introductionmentioning

confidence: 99%

“…Since VM requests in IaaS clouds are usually different in terms of software and tools for which they are initiated, their corresponding VMs exhibit complex behaviors and sophisticated interactions throughout their lifetime that enable VMMs to manage a wide variety of VM behaviors. Consequently, after running for a long time or managing heavy workloads, VMMs, like any other software, age and slow due to a multitude of internal errors and diverse behaviors of VMs [7,[10][11][12][13].…”

Section: Introductionmentioning

confidence: 99%

Modeling and Evaluation of Power-Aware Software Rejuvenation in Cloud Systems

2018

View full text Add to dashboard Cite

Long and continuous running of software can cause software aging-induced errors and failures. Cloud data centers suffer from these kinds of failures when Virtual Machine Monitors (VMMs), which control the execution of Virtual Machines (VMs), age. Software rejuvenation is a proactive fault management technique that can prevent the occurrence of future failures by terminating VMMs, cleaning up their internal states, and restarting them. However, the appropriate time and type of VMM rejuvenation can affect performance, availability, and power consumption of a system. In this paper, an analytical model is proposed based on Stochastic Activity Networks for performance evaluation of Infrastructure-as-a-Service cloud systems. Using the proposed model, a two-threshold power-aware software rejuvenation scheme is presented. Many details of real cloud systems, such as VM multiplexing, migration of VMs between VMMs, VM heterogeneity, failure of VMMs, failure of VM migration, and different probabilities for arrival of different VM request types are investigated using the proposed model. The performance of the proposed rejuvenation scheme is compared with two baselines based on diverse performance, availability, and power consumption measures defined on the system.

show abstract

Fault tolerance in cloud computing environment: A systematic survey

Cited by 82 publications

References 53 publications

Optimum Interval for Application-level Checkpoints

Optimum Interval for Application-level Checkpoints

A Ranking Based Model for Selecting Optimum Cloud Geographical Region

Modeling and Evaluation of Power-Aware Software Rejuvenation in Cloud Systems

Contact Info

Product

Resources

About