Bottleneck Detection Using Statistical Intervention Analysis

Malkowski, Simon; Hedwig, Markus; Parekh, Jason; Pu, Calton; Sahai, Akhil

doi:10.1007/978-3-540-75694-1_11

Cited by 19 publications

(7 citation statements)

References 7 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…These control limits Statistical Intervention Analysis Bottleneck identification. [43] describe the range of expected variability in the data over a period of time. When new observations fall outside outside the set control limits they are detected as anomalies and their cause(s) must be identified and corrected [87].…”

Section: Statistical Process Controlmentioning

confidence: 99%

Performance Anomaly Detection and Bottleneck Identification

2015

View full text Add to dashboard Cite

In order to meet stringent performance requirements, system administrators must e↵ectively detect undesirable performance behaviours, identify potential root causes and take adequate corrective measures. The problem of uncovering and understanding performance anomalies and their causes (bottlenecks) in di↵erent system and application domains is well studied. In order to assess progress, research trends and identify open challenges, we have reviewed major contributions in the area and present our findings in this survey. Our approach provides an overview of anomaly detection and bottleneck identification research as it relates to the performance of computing systems. By identifying fundamental elements of the problem, we are able to categorize existing solutions based on multiple factors such as the detection goals, nature of applications and systems, system observability, and detection methods.

show abstract

Section: Statistical Process Controlmentioning

confidence: 99%

Performance Anomaly Detection and Bottleneck Identification

2015

View full text Add to dashboard Cite

show abstract

“…In order to approximate the exact workload that can saturate the critical hardware resource, the procedure uses a statistical intervention analysis [11] on the SLO-satisfaction of a system. The main idea of such analysis is to evaluate the stability of the SLO-satisfaction of the system as workload increases; the SLO-satisfaction should be nearly constant under low workload and start to deteriorate significantly once the workload saturates the critical hardware resource.…”

Section: ) Inferring a Good Allocation Of Local Soft Resourcesmentioning

confidence: 99%

“…The main idea of such analysis is to evaluate the stability of the SLO-satisfaction of the system as workload increases; the SLO-satisfaction should be nearly constant under low workload and start to deteriorate significantly once the workload saturates the critical hardware resource. Readers who are interested in more details can refer to our previous paper [11].…”

Section: ) Inferring a Good Allocation Of Local Soft Resourcesmentioning

confidence: 99%

The Impact of Soft Resource Allocation on n-Tier Application Scalability

Wang

Malkowski

Kanemasa

et al. 2011

2011 IEEE International Parallel &Amp; Distributed Processing Symposium

Self Cite

View full text Add to dashboard Cite

Abstract-Good performance and efficiency, in terms of high quality of service and resource utilization for example, are important goals in a cloud environment. Through extensive measurements of an n-tier application benchmark (RUBBoS), we show that overall system performance is surprisingly sensitive to appropriate allocation of soft resources (e.g., server thread pool size). Inappropriate soft resource allocation can quickly degrade overall application performance significantly. Concretely, both under-allocation and over-allocation of thread pool can lead to bottlenecks in other resources because of non-trivial dependencies. We have observed some non-obvious phenomena due to these correlated bottlenecks. For instance, the number of threads in the Apache web server can limit the total useful throughput, causing the CPU utilization of the C-JDBC clustering middleware to decrease as the workload increases. We provide a practical iterative solution approach to this challenge through an algorithmic combination of operational queuing laws and measurement data. Our results show that soft resource allocation plays a central role in the performance scalability of complex systems such as n-tier applications in cloud environments.

show abstract

“…In practice we use a simple statistical intervention analysis [13] to approximate N , where the main idea of this analysis is to find the minimum load (N ) beyond which the increments of throughput becomes negligible with further increment of load. Suppose the load in a server varies between [N min , N max ]; then we divide [N min , N max ] into k even intervals (e.g., k = 100) and calculate the average throughput in each load interval based on the load/throughput samples we collected during the experimental period.…”

Section: Congestion Point N Determinationmentioning

confidence: 99%

Detecting Transient Bottlenecks in n-Tier Applications through Fine-Grained Analysis

Wang

Kanemasa

et al. 2013

2013 IEEE 33rd International Conference on Distributed Computing Systems

Self Cite

View full text Add to dashboard Cite

Identifying the location of performance bottlenecks is a non-trivial challenge when scaling n-tier applications in computing clouds. Specifically, we observed that an n-tier application may experience significant performance loss when there are transient bottlenecks in component servers. Such transient bottlenecks arise frequently at high resource utilization and often result from transient events (e.g., JVM garbage collection) in an n-tier system and bursty workloads. Because of their short lifespan (e.g., milliseconds), these transient bottlenecks are difficult to detect using current system monitoring tools with sampling at intervals of seconds or minutes. We describe a novel transient bottleneck detection method that correlates throughput (i.e., request service rate) and load (i.e., number of concurrent requests) of each server in an n-tier system at fine time granularity. Both throughput and load can be measured through passive network tracing at millisecond-level time granularity. Using correlation analysis, we can identify the transient bottlenecks at time granularities as short as 50ms. We validate our method experimentally through two case studies on transient bottlenecks caused by factors at the system software layer (e.g., JVM garbage collection) and architecture layer (e.g., Intel SpeedStep).

show abstract

Bottleneck Detection Using Statistical Intervention Analysis

Cited by 19 publications

References 7 publications

Performance Anomaly Detection and Bottleneck Identification

Performance Anomaly Detection and Bottleneck Identification

The Impact of Soft Resource Allocation on n-Tier Application Scalability

Detecting Transient Bottlenecks in n-Tier Applications through Fine-Grained Analysis

Contact Info

Product

Resources

About