2018 48th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN) 2018
DOI: 10.1109/dsn.2018.00023
|View full text |Cite
|
Sign up to set email alerts
|

Understanding and Analyzing Interconnect Errors and Network Congestion on a Large Scale HPC System

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1

Citation Types

0
1
0

Year Published

2019
2019
2023
2023

Publication Types

Select...
3
2
1

Relationship

0
6

Authors

Journals

citations
Cited by 10 publications
(1 citation statement)
references
References 18 publications
0
1
0
Order By: Relevance
“…In [39], the authors proposed a novel scheme that used the spatial locality of failures to provide better application and system performance. In [40], the authors developed a thorough understanding of interconnect errors and job characteristics on an enterprise class supercomputer. Network traffic and access patterns were identified by Google's Network Telemetry [41].…”
Section: Related Workmentioning
confidence: 99%
“…In [39], the authors proposed a novel scheme that used the spatial locality of failures to provide better application and system performance. In [40], the authors developed a thorough understanding of interconnect errors and job characteristics on an enterprise class supercomputer. Network traffic and access patterns were identified by Google's Network Telemetry [41].…”
Section: Related Workmentioning
confidence: 99%