Disparity: Scalable Anomaly Detection for Clusters

Desai, Narayan; Bradshaw, Rick; Lusk, Ewing L.

doi:10.1109/icpp-w.2008.30

Cited by 2 publications

(2 citation statements)

References 6 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…The second part of PHM is a core infrastructure that launches the diagnostics, aggregates the results, generates/updates snapshots, and will ultimately provide control to determine what diagnostic to launch next. Part of this infrastructure (the diagnostic launching, snapshot manipulation, and control logic) will run on a head node, and part of this infrastructure (data aggregation) is distributed across the system and uses the system itself to efficiently compute statistics, a la Disparity [7]. An important goal is to make it as easy as possible for users of PHM to develop their own diagnostics without needing to reinvent such an infrastructure.…”

Section: Approachmentioning

confidence: 99%

“…However, PHM is the first to utilize a global view of system performance rather than an aggregation of local views. To elaborate, existing tools for monitoring system performance, including Disparity [7], NWPerf [15], Supermon [24], ClusterProbe [12], and Performance Co-Pilot [22], are all designed for continuous monitoring of system performance. Consequently, measurement overhead is a serious concern.…”

Section: Node-local Performance Monitoringmentioning

confidence: 99%

See 1 more Smart Citation

Performance Health Monitoring of Large-Scale Systems

Rajamony¹

2014

View full text Add to dashboard Cite

As largest-scale computing systems currently progress to multi-petaflop systems and beyond, achieving their full potential requires increasing attention to the performance health of the systems overall, wherein degraded performance of a single subsystem such as a node, NIC, memory module, kernel process, etc. can effectively degrade the performance of an entire system running a large application. The current state of the art in identifying and remediating sources of performance loss is as much an art as a science, typically requiring labor-and expertise-intensive human resources operating in an ad-hoc and experience-based fashion. In this work we motivate the need for a new type of performance analysis tool, a Performance Health Monitor (PHM), that will efficiently pinpoint sources of lost performance on the largest systems and enable applications to experience a consistent performance environment from run to run. PHM is aimed at providing a global view of system performance in contrast to an aggregation of local views as well as to explain system performance issues and suggest corrective actions or pinpoint likely causes. A spectrum of usage modes will be supported ranging from quick, partial-system, userinitiated checks to comprehensive, full-system evaluations. Such evaluations may be archived as performance snapshots for comparing distinct systems, system subsets, or the same system pre-and post-upgrade.

show abstract

Section: Approachmentioning

confidence: 99%

Section: Node-local Performance Monitoringmentioning

confidence: 99%

Performance Health Monitoring of Large-Scale Systems

Rajamony¹

2014

View full text Add to dashboard Cite

show abstract

Meta-monitoring system for ensuring a fault tolerance of the intelligent high-performance computing environment

Sidorov¹,

Sidorova²,

Kurzibova³

2019

The International Workshop on Information, Computation, and Control Systems for Distributed Environments 2019

View full text Add to dashboard Cite

The high-performance computing systems include a large number of hardware and software components that can cause failures. Nowadays, the well-known approaches to monitoring and ensuring the fault tolerance of the high-performance computing systems do not allow to fully implement its integrated solution. The aim of this paper is to develop methods and tools for identifying abnormal situations during large-scale computational experiments in high-performance computing environments, localizing these malfunctions, automatically troubleshooting if this is possible, and automatically reconfiguring the computing environment otherwise. The proposed approach is based on the idea of integrating monitoring systems, used in different nodes of the environment, into a unified meta-monitoring system. The use of the proposed approach minimizes the time to perform diagnostics and troubleshooting through the use of parallel operations. It also improves the resiliency of the computing environment processes by preventive measures to diagnose and troubleshoot of failures. These advantages lead to increasing the reliability and efficiency of the environment functioning. The novelty of the proposed approach is underlined by the following elements: mechanisms of the decentralized collection, storage, and processing of monitoring data; a new technique of decision-making in reconfiguring the environment; the supporting the provision of fault tolerance and reliability not only for software and hardware, but also for environment management systems.

show abstract

Disparity: Scalable Anomaly Detection for Clusters

Cited by 2 publications

References 6 publications

Performance Health Monitoring of Large-Scale Systems

Performance Health Monitoring of Large-Scale Systems

Meta-monitoring system for ensuring a fault tolerance of the intelligent high-performance computing environment

Contact Info

Product

Resources

About