Proceedings of the 23rd International Symposium on High-Performance Parallel and Distributed Computing 2014
DOI: 10.1145/2600212.2600219
|View full text |Cite
|
Sign up to set email alerts
|

Transparent checkpoint-restart over infiniband

Abstract: InfiniBand is widely used for low-latency, high-throughput cluster computing. Saving the state of the InfiniBand network as part of distributed checkpointing has been a long-standing challenge for researchers. Because of a lack of a solution, typical MPI implementations have included custom checkpoint-restart services that "tear down" the network, checkpoint each node as if the node were a standalone computer, and then re-connect the network again. We present the first example of transparent, system-initiated … Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1

Citation Types

0
27
0
1

Year Published

2016
2016
2022
2022

Publication Types

Select...
3
2
2

Relationship

1
6

Authors

Journals

citations
Cited by 30 publications
(28 citation statements)
references
References 22 publications
0
27
0
1
Order By: Relevance
“…In testing on the LU.E benchmark, we saw runtime overhead rise to 9% for 4K CPU cores compared to the 1.7% runtime overhead at 2K cores reported by [13]. This was due to the non-scalable tracing of send/receive requests required by the InfiniBand checkpointing code to shadow hardware state, since InfiniBand devices don't provide a way to "peek" at the current state.…”
Section: Reducing Rc-mode Runtime Overheadmentioning
confidence: 83%
See 2 more Smart Citations
“…In testing on the LU.E benchmark, we saw runtime overhead rise to 9% for 4K CPU cores compared to the 1.7% runtime overhead at 2K cores reported by [13]. This was due to the non-scalable tracing of send/receive requests required by the InfiniBand checkpointing code to shadow hardware state, since InfiniBand devices don't provide a way to "peek" at the current state.…”
Section: Reducing Rc-mode Runtime Overheadmentioning
confidence: 83%
“…At 32,752 CPU cores, the tests use one-third of the compute nodes of Stampede. This is sixteen times as many cores as the largest previous transparent checkpoint of which we are aware [13]. The Stampede supercomputer is rated at 5.2 PFlops (RMAX:sustained) or 8.5 PFlops (RPEAK:peak).…”
Section: Setupmentioning
confidence: 99%
See 1 more Smart Citation
“…For example, memory pinning register segments in the device (to allow address translation and to retrieve authentication tokens) is not checkpointable. In order to circumvent this issue, DMTCP provided a plugin completely wrapping the libverbs (low-level Infiniband programming interface) in order to track and preserve a shadow state of all the operations taking place on the card [11]. This approach enabled transparent checkpointing of Infiniband networks, but not without some drawbacks.…”
Section: High-speed Network Supportmentioning
confidence: 99%
“…As of this writing, Stampede is the #10 supercomputer on the Top500 list [78]. In all cases, each computer node was running This is sixteen times as many cores as the largest previous transparent checkpoint of which we are aware [15]. The Stampede supercomputer is rated at 5.2 PFlops (RMAX:sustained) or 8.5 PFlops (RPEAK:peak) [78].…”
Section: Setupmentioning
confidence: 99%