Leveraging Many Simple Statistical Models to Adaptively Monitor Software Systems

Munawar, Mohammad A.; Ward, Paul A. S.

doi:10.1007/978-3-540-74742-0_42

Cited by 16 publications

(23 citation statements)

References 10 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Figure 4 illustrates the adaptation of sampling rate according to the resource utilization. The concrete model of the monitoring adaptation is also to be improved and simple statistical models are intended to be experimented first [19]. In a similar way, the same scenario can be applied to others Grid middleware components that tend to be overloaded.…”

Section: Wms Overloadmentioning

confidence: 99%

Issues and scenarios for self-managing grid middleware

Collet

Křikava

Montagnat³

et al. 2010

Proceedings of the 2nd Workshop on Grids Meets Autonomic Computing

View full text Add to dashboard Cite

Despite significant efforts to achieve reliable grid middlewares, grid infrastructures still encounter important difficulties to implement the promise of ubiquitous, seamless and transparent computing. Identified causes are numerous, such as the complexity of middleware stacks, dependence to many distributed resources, heterogeneity of hardware and software operated or incompatibilities between software components declared as interoperable. Based on failures that occurred during a large data challenge run on a grid dedicated to neuroscience, we identify scenarios that can be handled through autonomic management associated to the grid middleware. We also outline a flexible self-adaptive framework that aims at using model-driven development to facilitate the engineering, integration and reuse of MAPE-K loops in large scale distributed systems.

show abstract

Section: Wms Overloadmentioning

confidence: 99%

Issues and scenarios for self-managing grid middleware

Collet

Křikava

Montagnat³

et al. 2010

Proceedings of the 2nd Workshop on Grids Meets Autonomic Computing

View full text Add to dashboard Cite

show abstract

“…Handling of temporary faults has been done extensively (Munawar and Ward, 2011). This paper handles permanent faults that occur in a processor chip's data path.…”

Section: Introductionmentioning

confidence: 99%

On the field design bug tolerance on a multi-core processor using FPGA

Sriraman

Pattabiraman

2017

IJHPCN

View full text Add to dashboard Cite

Abstract:In recent times, with increased transistor density, it is impossible to verify all the components exhaustively for different scenarios. This results in design bugs also known as extrinsic hardware faults to escape into the processor chip in spite of various levels of testing. Hence, handling design bugs efficiently on the field is a necessity in modern multi-core processors. In this paper, an architecture and algorithm for self-repairing of design bugs in the data path using FPGA is proposed. The FPGA is re-configured during the run-time to take over the functions of the faulty component. To verify the effectiveness of the proposed design a representative sample of five faults are injected and handled. The proposed design's area overhead and time overhead calculations are done using Cadence ncverilog and gem5 simulator respectively. The area overhead of the proposed design is < 1% and performance improvement is around 2.5% compared to the existing techniques.

show abstract

“…We have described our invariant-identification and error detection approach based on simple linear regression in previous work [12,13]. Here we extend it to clustered systems, taking care to avoid identifying accidental correlations as invariants.…”

Section: Error Verificationmentioning

confidence: 99%

“…Agarwal et al [25] also describe an approach to create fault signatures based on correlation between change-points in different metrics. Our prior work [12] is the first to demonstrate automated adaptive monitoring, and focuses on achieving the benefits of continuous monitoring at a fraction of the cost. The current work augments our earlier approach by diagnosing faulty components using more-precise trace data instead of metric-based invariants.…”

Section: Related Workmentioning

confidence: 99%

“…The error verification step aims to limit the monitoring cost that arises because of false alarms, while providing a robust means for validating the existence of an error. Our verification step entails collecting a larger set of system metrics, among which stable, long-term correlations exist [10,11,12,13]. These correlations, also known as invariants, are captured a priori in the form of regression models using data collected from a healthy system.…”

Section: Error Verificationmentioning

confidence: 99%

See 1 more Smart Citation

Adaptive Monitoring with Dynamic Differential Tracing-Based Diagnosis

Munawar

Reidemeister

Jiang

et al. 2008

Managing Large-Scale Service Deployment

Self Cite

View full text Add to dashboard Cite

Abstract. Ensuring high availability, adequate performance, and proper operation of enterprise software systems requires continuous monitoring. Today, most systems operate with minimal monitoring, typically based on service-level objectives (SLOs). Detailed metric-based monitoring is often too costly to use in production, while tracing is prohibitively expensive. Configuring monitoring when problems occur is a manual process.In this paper we propose an alternative: Minimal monitoring with SLOs is used to detect errors. When an error is detected, detailed monitoring is automatically enabled to validate errors using invariant-correlation models. If validated, Application-Response-Measurement (ARM) tracing is dynamically activated on the faulty subsystem and a healthy peer to perform differential trace-data analysis and diagnosis.Based on fault-injection experiments, we show that our system is effective; it correctly detected and validated errors caused by 14 out of 15 injected faults. Differential analysis of the trace data collected for 210 seconds allowed us to top-rank the faulty component in 80% of the cases. In the remaining cases the faulty component was ranked within the top-7 out of 81 components. We also demonstrate that the overhead of our system is low; given a false positive rate of one per hour, the overhead is less than 2.5%.

show abstract

Leveraging Many Simple Statistical Models to Adaptively Monitor Software Systems

Cited by 16 publications

References 10 publications

Issues and scenarios for self-managing grid middleware

Issues and scenarios for self-managing grid middleware

On the field design bug tolerance on a multi-core processor using FPGA

Adaptive Monitoring with Dynamic Differential Tracing-Based Diagnosis

Contact Info

Product

Resources

About