“…A number of advanced resilience technologies have been developed and/or are currently in development, including checkpoint/restart-specific file and storage systems, incremental/differential checkpointing, message logging for uncoordinated checkpointing, fault tolerant message passing interface (FT-MPI), containment domains, algorithm-based fault tolerance (ABFT), rejuvenation, reliability-aware scheduling, proactive migration, and redundancy [7], [8], [9], [10], [11], [12], [13], [14], [15], [16], [17], [18], [19]. However, there are currently no tools, methods, and metrics to compare them fairly, especially at extreme scale, and to identify the cost/benefit trade-off.…”