2016 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW) 2016
DOI: 10.1109/ipdpsw.2016.188
|View full text |Cite
|
Sign up to set email alerts
|

Large-Scale Persistent Numerical Data Source Monitoring System Experiences

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1

Citation Types

0
4
0

Year Published

2018
2018
2021
2021

Publication Types

Select...
4
1

Relationship

0
5

Authors

Journals

citations
Cited by 5 publications
(4 citation statements)
references
References 4 publications
0
4
0
Order By: Relevance
“…Accounts of long-term production experiences, however, are currently still rare and mostly cover ODAV deployments. Brandt et al [10] discuss their 2-year LDMS deployment on Blue Waters, a large-scale HPC system with more than 27,000 nodes. The authors describe a series of technical challenges related to reliability, overhead and data consistency, as well as dive into system-specific issues, such as clock skew effects.…”
Section: State Of the Artmentioning
confidence: 99%
“…Accounts of long-term production experiences, however, are currently still rare and mostly cover ODAV deployments. Brandt et al [10] discuss their 2-year LDMS deployment on Blue Waters, a large-scale HPC system with more than 27,000 nodes. The authors describe a series of technical challenges related to reliability, overhead and data consistency, as well as dive into system-specific issues, such as clock skew effects.…”
Section: State Of the Artmentioning
confidence: 99%
“…Establishing the necessary framework for holistic and continuous monitoring of large-scale HPC systems and their infrastructure is extremely challenging in many ways [4,9].…”
Section: Monitoring Challengesmentioning
confidence: 99%
“…Current approaches either lack the necessary efficiency to be utilized in production systems or support only post-mortem analysis that does not present online data about application events during the execution. Also, developers face challenges when analyzing applications that scale to larger parallel systems [9].…”
Section: Introductionmentioning
confidence: 99%
“…We integrate our approach with the Lightweight Distributed Metric Service (LDMS) system [2], a monitoring system used on large-scale computational platforms [9]. LDMS provides the infrastructure to gather streams of performance data efficiently while keeping the overhead low.…”
Section: Introductionmentioning
confidence: 99%