2018
DOI: 10.1109/tpds.2017.2773483
|View full text |Cite
|
Sign up to set email alerts
|

Unraveling Network-Induced Memory Contention: Deeper Insights with Machine Learning

Abstract: Abstract-Remote Direct Memory Access (RDMA) is expected to be an integral communication mechanism for future exascale systems -enabling asynchronous data transfers, so that applications may fully utilize CPU resources while simultaneously sharing data amongst remote nodes. In this work we examine Network-induced Memory Contention (NiMC) on Infiniband networks. We expose the interactions between RDMA, main-memory and cache, when applications and out-of-band services compete for memory resources. We then explore… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1

Citation Types

0
2
0

Year Published

2020
2020
2022
2022

Publication Types

Select...
1
1
1

Relationship

0
3

Authors

Journals

citations
Cited by 3 publications
(2 citation statements)
references
References 28 publications
0
2
0
Order By: Relevance
“…Preliminary research works addressing anomaly detection and fault prediction using real anomalies have been proposed, but they tend to focus on the availability of a single (HW or SW) component of the system and not on the availability of the entire computing nodes. Ostruckow et al [32] analyze the (specific) failures of GPU processors, Boixaderas et al [23] aim at predicting the memory (DRAM) failures, Di et al [33] detect silent data corruption, Groves et al [34] predict sub-optimal operation due to memory contention. These approaches focus on specific HW components, but in the age of Exascale, with larger systems, more components and higher costs, such partial detection and monitoring systems should be combined and enhanced with supervised and unsupervised holistic anomaly detection models [14], [24].…”
Section: Related Workmentioning
confidence: 99%
“…Preliminary research works addressing anomaly detection and fault prediction using real anomalies have been proposed, but they tend to focus on the availability of a single (HW or SW) component of the system and not on the availability of the entire computing nodes. Ostruckow et al [32] analyze the (specific) failures of GPU processors, Boixaderas et al [23] aim at predicting the memory (DRAM) failures, Di et al [33] detect silent data corruption, Groves et al [34] predict sub-optimal operation due to memory contention. These approaches focus on specific HW components, but in the age of Exascale, with larger systems, more components and higher costs, such partial detection and monitoring systems should be combined and enhanced with supervised and unsupervised holistic anomaly detection models [14], [24].…”
Section: Related Workmentioning
confidence: 99%
“…In work by Phothilimthana et al [52], a DPU was used as a write-back cache for a host-based key-value store (KVS) and observed a 28-60% performance improvement. By offloading to portions of a KVS, host applications may experience less pollution of the cache due to network-induced memory contention [25]. Despite the benefits, a challenge of utilizing the DPU as a cache is maintaining a consistent state between the data that resides on the DPU and host memory.…”
Section: Key Value Storesmentioning
confidence: 99%