Unraveling Network-Induced Memory Contention: Deeper Insights with Machine Learning

Groves, Taylor; Grant, Ryan E.; Gonzales, Aaron; Arnold, Dorian

doi:10.1109/tpds.2017.2773483

Cited by 3 publications

(2 citation statements)

References 28 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Preliminary research works addressing anomaly detection and fault prediction using real anomalies have been proposed, but they tend to focus on the availability of a single (HW or SW) component of the system and not on the availability of the entire computing nodes. Ostruckow et al [32] analyze the (specific) failures of GPU processors, Boixaderas et al [23] aim at predicting the memory (DRAM) failures, Di et al [33] detect silent data corruption, Groves et al [34] predict sub-optimal operation due to memory contention. These approaches focus on specific HW components, but in the age of Exascale, with larger systems, more components and higher costs, such partial detection and monitoring systems should be combined and enhanced with supervised and unsupervised holistic anomaly detection models [14], [24].…”

Section: Related Workmentioning

confidence: 99%

Anomaly Detection and Anticipation in High Performance Computing Systems

Borghesi

Molan

Milano

et al. 2022

IEEE Trans. Parallel Distrib. Syst.

View full text Add to dashboard Cite

In their quest towards Exascale, High Performance Computing (HPC) systems are rapidly becoming larger and more complex, together with the issues concerning their maintenance. Luckily, many current HPC systems are endowed with data monitoring infrastructures that characterize the system state, and whose data can be used to train Deep Learning (DL) anomaly detection models, a very popular research area. However, the lack of labels describing the state of the system is a wide-spread issue, as annotating data is a costly task, generally falling on human system administrators and thus does not scale toward exascale.In this work we investigate the possibility to extract labels from a service monitoring tool (Nagios) currently used by HPC system administrators to flag the nodes which undergo maintenance operations. This allows to automatically annotate data collected by a fine-grained monitoring infrastructure; this labelled data is then used to train and validate a DL model for anomaly detection. We conduct the experimental evaluation on a tier-0 production supercomputer hosted at CINECA, Bologna, Italy. The results reveal that the DL model can accurately detect the real failures, and, moreover, it can predict the insurgency of anomalies, by systematically anticipating the actual labels (i.e. the moment when system administrators realize when an anomalous event happened); the average advance time computed on historical traces is around 45 minutes. The proposed technology can be easily scaled toward exascale systems to easy their maintenance.

show abstract

Section: Related Workmentioning

confidence: 99%

Anomaly Detection and Anticipation in High Performance Computing Systems

Borghesi

Molan

Milano

et al. 2022

IEEE Trans. Parallel Distrib. Syst.

View full text Add to dashboard Cite

show abstract

“…In work by Phothilimthana et al [52], a DPU was used as a write-back cache for a host-based key-value store (KVS) and observed a 28-60% performance improvement. By offloading to portions of a KVS, host applications may experience less pollution of the cache due to network-induced memory contention [25]. Despite the benefits, a challenge of utilizing the DPU as a cache is maintaining a consistent state between the data that resides on the DPU and host memory.…”

Section: Key Value Storesmentioning

confidence: 99%

Use It or Lose It: Cheap Compute Everywhere

Groves

Hazen

Lockwood

et al. 2022

Communications in Computer and Information Science

View full text Add to dashboard Cite

Moore's Law is tapering off, but FLOPS per dollar continues to grow. Inexpensive CPUs are emerging everywhere from network to storage as an effective way of managing and deploying hardware and firmware as well as providing services close to the data path. Examples of this include ARM cores within Mellanox Bluefield, Broadcom Stingray DPUs, switches, and compute in storage. This additional processing power can be useful for (1) enabling higher throughput, (2) decreasing or hiding latency, (3) increasing power/cost efficiency, (4) alleviating contention for oversubscribed resources. In order to make these additional resources available to a wide range of services and applications we must first develop: (1) an understanding of the strengths and weaknesses of the hardware, (2) an understanding of how portions of a workload might be decomposed into tasks for offload, (3) abstractions to allow code portability on the heterogeneous components. We take a look at existing hardware trends through a survey of existing and original work to examine how new compute-in-network show promise, where they fall short and how HPC might evolve to take advantage of them.

show abstract