Revealing anomalies at the operating system (OS) level to support online diagnosis activities of complex software systems is a promising approach when traditional detection mechanisms (e.g., based on event logs, probes and heartbeats) are inadequate or cannot be applied. In this paper we propose a configurable detection framework to reveal anomalies in the OS behavior, related to system misbehaviors. The detector is based on online statistical analyses techniques, and it is designed for systems that operate under variable and non-stationary conditions.\ud The framework is evaluated to detect the activation of software faults in a complex distributed system for Air Traffic Management (ATM). Results of experiments with two different OSs, namely Linux Red Hat EL5 and Windows Server 2008, show that the detector is effective for mission-critical systems. The framework can be configured to select the monitored indicators so as to tune the level of intrusivity. A sensitivity analysis of the detector parameters is carried out to show their impact on the performance and to give to practitioners guidelines for its field tuning
The phenomenon of software aging is increasingly recognized as a relevant problem of long-running systems. Numerous experiments have been carried out in the last decade to empirically analyze software aging. Such experiments, besides highlighting the relevance of the phenomenon, have shown that aging is tightly related to the applied workload. However, due to the differences among the experimented applications and among the experimental conditions, results of past studies are not comparable to each other. This prevent from drawing general conclusions (e.g., about the aging-workload relationship), and from comparing systems from the aging perspective. In this paper, we propose a procedure to carry out aging experiments in different applications for: i) assessing aging trend of the individual systems, as well as assessing differences among them (i.e., obtaining comparable results), ii) inferring workload-aging relationships from experiments performed on different applications, by highlighting the most relevant workload parameters. The procedure is applied, through a set of long-running experiments, to three real-scale software applications, namely Apache Web Server, James Mail Server, and CARDAMOM, a middleware for the development of air traffic control (ATC) systems
Software rejuvenation has been addressed in hundreds of papers since it was proposed in 1995 by Huang et al. The growing number of research papers shows the great importance of this topic. However, no paper has studied yet software rejuvenation in the real world. This paper investigates to what extent software rejuvenation techniques are integrated in the IT and Telco solutions. For this purpose, it has been conducted an intensive search of different sources such as company's product websites, technical papers, white papers, US patents, and consultant surveys. The results show that IT and Telco companies develop software rejuvenation solutions to deal with software aging. The number of US patents addressing this issue confirms the interest of industry to develop mechanisms to deal with software aging-related failures. It has been observed that real software rejuvenation solutions mainly use time-based or threshold-based policies, while the US patents are focused on predictive approaches.
This study investigates software aging effects caused by the activation of concurrency bugs in a wellknown database management system (DBMS), namely MySQL. Experiments with different workloads are performed in order to reproduce the most likely conditions for concurrency bugs activation. Besides the typical aging effects observed in many operational systems (i.e., a gradual degradation over time), results highlight that both available resources and DBMS performance (e.g. service rate, service time, and connection latency) can decrease with time in a hard-to-predict way. We observed that, due to the activation of concurrency bug, the DBMS enters a degraded state in which: i) the estimation of Time-To-Failure (TTF) by means of memory depletion trend analysis is highly inaccurate, and ii) the failure rate does not depend on the instantaneous and/or mean accumulated work. Results suggest that, in such cases, finer-grained indicators and/or different techniques need to be taken into account for properly preventing failures.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2024 scite LLC. All rights reserved.
Made with đź’™ for researchers
Part of the Research Solutions Family.