Aggregation of Real-Time System Monitoring Data for Analyzing Large-Scale Parallel and Distributed Computing Environments

Bohm, Swen; Engelmann, Christian; Scott, Stephen L.

doi:10.1109/hpcc.2010.32

Cited by 11 publications

(12 citation statements)

References 11 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…It has been used for scalable parallel performance-monitoring and profiling of high-performance computing applications [8,20]. MRNet can pass filtered information up and down the tree, which can be processed at each tree node.…”

Section: Average Local Window Size and Vector Agementioning

confidence: 99%

“…This size is slightly larger than the 64 bytes used by Bohm et al [8] for monitoring and the 84 bytes used in MOSIX [9], a decentralized cluster management system that uses process migration for load balancing. Recall that the global information must also be included.…”

Section: Gossip Parameters Using a Single Mastermentioning

confidence: 99%

“…A reasonable local entry size is 100 bytes. This size is slightly larger than the 64 bytes used by Bohm et al [8] for monitoring and the 84 bytes used in MOSIX [9], a decentralized cluster management system that uses process migration for load balancing. MOSIX clusters resemble a single colony without a master, where each entry includes a node's CPU and memory information and status.…”

Section: Gossip Parameters Using a Single Mastermentioning

confidence: 99%

“…This paper presents a new paradigm for providing online information to the management system of scalable clusters, consisting of a large number of nodes and one or more masters that manage these nodes. It has been used for scalable parallel performance-monitoring and profiling of high-performance computing applications [8,20]. The presented algorithms are decentralized, scalable and resilient, working well even when some nodes fail, without needing any recovery protocol.…”

mentioning

confidence: 99%

“…MRNet can pass filtered information up and down the tree, which can be processed at each tree node. It has been used for scalable parallel performance-monitoring and profiling of high-performance computing applications [8,20]. However, MRNet is unsuitable for passing information between nodes of the same tree level, as required by our colony's internal management system, because it requires messages to pass through a common parent, rather than directly.Cluster management systems are known to perform well on moderate-sized Linux clusters, but the management overhead on large-scale supercomputers is not well understood, as these systems are more susceptible to network contentions.…”

mentioning

confidence: 99%

See 4 more Smart Citations

Resilient gossip algorithms for collecting online management information in exascale clusters

Barak

Drezner

Levy

et al. 2015

Concurrency and Computation

View full text Add to dashboard Cite

Management of forthcoming exascale clusters requires frequent collection of run-time information about the nodes and the running applications. This paper presents a new paradigm for providing online information to the management system of scalable clusters, consisting of a large number of nodes and one or more masters that manage these nodes. We describe the details of resilient gossip algorithms for sharing local information within subsets of nodes and for sending global information to a master, which holds information on all the nodes. The presented algorithms are decentralized, scalable and resilient, working well even when some nodes fail, without needing any recovery protocol. The paper gives formal expressions for approximating the average ages of the local information at each node and the information collected by the master. It then shows that these results closely match the results of simulations and measurements on a real cluster. The paper also investigates the resilience of the algorithms and the impact on the average age when nodes or masters fail. The main outcome of this paper is that partitioning of large clusters can improve the quality of information available to the management system without increasing the number of messages per node.In the following algorithm, colony nodes share information and push (send) global windows of information to the master:Push algorithm -colonies send information to the master: At a fixed point every unit of time, each colony node:Updates its vector and immediately sends a local window with all its vector entries whose current age does not exceed T to another node in its colony, chosen randomly with a uniform distribution. When a colony node receives a local window it:Adjusts the window for network latency. Replaces each vector entry with the received window entry, if the latter is newer. Registers the arrival time in the replaced vector entries, using its local clock. With probability k n (k is the intended average update rate), updates its vector and sends a global window to the master. When the master receives a global window it:Adjusts the window for network latency. Registers the window's arrival time on all the received entries using its local clock. Updates all its entries with the latest received window entries, if the latter is newer. The pull algorithmIn this algorithm, colony nodes share information, while the master regularly pulls (requests) global windows of information from one or a few randomly selected nodes in each colony:Pull algorithm -master requests information from each colony: At a fixed point every unit of time, each colony node:Updates its vector and immediately sends a local window with all its vector entries whose current age does not exceed T to another node in its colony, chosen randomly with a uniform distribution. When a colony node receives a local window it:Adjusts the window for network latency. Replaces each vector entry with the received window entry, if the latter is newer. Registers the arrival time in the replaced vector entrie...

show abstract

Section: Average Local Window Size and Vector Agementioning

confidence: 99%

Section: Gossip Parameters Using a Single Mastermentioning

confidence: 99%

Section: Gossip Parameters Using a Single Mastermentioning

confidence: 99%

mentioning

confidence: 99%

mentioning

confidence: 99%

See 3 more Smart Citations

Resilient gossip algorithms for collecting online management information in exascale clusters

Barak

Drezner

Levy

et al. 2015

Concurrency and Computation

View full text Add to dashboard Cite

show abstract

Scalable and efficient workload hotspot detection in virtualized environment

Zhou

Guo

et al. 2014

Cluster Comput

View full text Add to dashboard Cite

Adaptive, scalable and reliable monitoring of big data on clouds

Andreolini

Colajanni

Pietri

et al. 2015

Journal of Parallel and Distributed Computing

View full text Add to dashboard Cite

Real-time monitoring of cloud resources is crucial for a variety of tasks such as performance analysis, workload management, capacity planning and fault detection. Applications producing big data make the monitoring task very difficult at high sampling frequencies because of high computational and communication overheads in collecting, storing, and managing information. We present an adaptive algorithm for monitoring big data applications that adapts the intervals of sampling and frequency of updates to data characteristics and administrator needs. Adaptivity allows us to limit computational and communication costs and to guarantee high reliability in capturing relevant load changes. Experimental evaluations performed on a large testbed show the ability of the proposed adaptive algorithm to reduce resource utilization and communication overhead of big data monitoring without penalizing the quality of data, and demonstrate our improvements to the state of the art.Real-time monitoring of cloud resources is crucial for a variety of tasks such as performance analysis, workload management, capacity planning and fault detection. Applications producing big data make the monitoring task very difficult at high sampling frequencies because of high computational and communication overheads in collecting, storing, and managing information. We present an adaptive algorithm for monitoring big data applications that adapts the intervals of sampling and frequency of updates to data characteristics and administrator needs. Adaptivity allows us to limit computational and communication costs and to guarantee high reliability in capturing relevant load changes. Experimental evaluations performed on a large testbed show the ability of the proposed adaptive algorithm to reduce resource utilization and communication overhead of big data monitoring without penalizing the quality of data, and demonstrate our improvements to the state of the art

show abstract

Aggregation of Real-Time System Monitoring Data for Analyzing Large-Scale Parallel and Distributed Computing Environments

Cited by 11 publications

References 11 publications

Resilient gossip algorithms for collecting online management information in exascale clusters

Resilient gossip algorithms for collecting online management information in exascale clusters

Scalable and efficient workload hotspot detection in virtualized environment

Adaptive, scalable and reliable monitoring of big data on clouds

Contact Info

Product

Resources

About