In distributed, service-oriented environments, performance problem localization is required to provide self-healing capabilities and deliver the desired quality of service (QoS). This paper presents an automated approach to identifying system elements causing performance problems. Applying probabilistic inference to collected response time and elapsed time data, the approach 1) infers elapsed time for services where data is missing, 2) estimates the response time degradation caused by different services using the duration, abnormality and response time correlation of their elapsed times, and 3) identifies the services that are the most important causes of slow response time and yield the most benefit if recovered. The approach has been used to localize a performance problem on the test bed of a real-world serviceoriented Grid. Evaluation using simulations shows that the approach consistently achieves better accuracy than traditional techniques in various service-oriented settings.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.