2010 IEEE International Conference on Cluster Computing 2010
DOI: 10.1109/cluster.2010.20
|View full text |Cite
|
Sign up to set email alerts
|

RDMA-Based Job Migration Framework for MPI over InfiniBand

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
2
1

Citation Types

0
11
0

Year Published

2011
2011
2023
2023

Publication Types

Select...
4
3
2

Relationship

1
8

Authors

Journals

citations
Cited by 21 publications
(11 citation statements)
references
References 22 publications
0
11
0
Order By: Relevance
“…As opposed to evacuation in the migratable-objects model, task migration in CoCheck was synchronous with the execution of the system, i.e., before an MPI rank could be migrated, CoCheck had to make sure all communicating ranks would hold back messages until the rank was at its new location. Similar tools have recently used process-level live migration in MPI applications [36], [37], [38]. Those tools combine health monitoring of nodes with live migration to provide proactive fault tolerance.…”
Section: Related Workmentioning
confidence: 99%
“…As opposed to evacuation in the migratable-objects model, task migration in CoCheck was synchronous with the execution of the system, i.e., before an MPI rank could be migrated, CoCheck had to make sure all communicating ranks would hold back messages until the rank was at its new location. Similar tools have recently used process-level live migration in MPI applications [36], [37], [38]. Those tools combine health monitoring of nodes with live migration to provide proactive fault tolerance.…”
Section: Related Workmentioning
confidence: 99%
“…In [8], we have made an initial attempt to optimize (1) and (2) by leveraging RDMA to transfer checkpoint data. However it hasn't totally solved the problem since the heavy IO overhead at (3) still dominates.…”
Section: Introductionmentioning
confidence: 99%
“…Job/process migration [6][7][8][9], a pro-active faulttolerance mechanism, has been proposed as a complement to C/R. During a migration, the processes running on a source node are checkpointed and the checkpoint data is transferred to a healthy spare node where the processes are restarted.…”
Section: Introductionmentioning
confidence: 99%
“…We have proposed a file-based migration design [102] that creates checkpoint files of the processes on the migration source node and move the files via different transports to the target node to restart the processes. The team has made significant enhancements in data transmission at [101] to directly pump process image through a RDMA data pipeline from migration source node to the target node without any file system IO overhead.…”
Section: End-to-end Reliable Data Transmission In Mvapichmentioning
confidence: 99%