Future main memory will likely include Non-Volatile Memory. Non-Volatile Main Memory (NVMM) provides an opportunity to rethink checkpointing strategies for providing failure safety to applications. While there are many checkpointing and logging schemes in the literature, their use must be revisited as they incur high execution time overheads as well as a large number of additional writes to NVMM, which may significantly impact write endurance.In this article, we propose a novel recompute-based failure safety approach and demonstrate its applicability to loop-based code. Rather than keeping a fully consistent logging state, we only log enough state to enable recomputation. Upon a failure, our approach recovers to a consistent state by determining which parts of the computation were not completed and recomputing them. Effectively, our approach removes the need to keep checkpoints or logs, thus reducing execution time overheads and improving NVMM write endurance at the expense of more complex recovery. We compare our new approach against logging and checkpointing on five scientific workloads, including tiled matrix multiplication, on a computer system model that was built on gem5 and supports Intel PMEM instruction extensions. For tiled matrix multiplication, our recompute approach incurs an execution time overhead of only 5%, in contrast to 8% overhead with logging and 207% overhead with checkpointing. Furthermore, recompute only adds 7% additional NVMM writes, compared to 111% with logging and 330% with checkpointing. We also conduct experiments on real hardware, allowing us to run our workloads to completion while varying the number of threads used for computation. These experiments substantiate our simulation-based observations and provide a sensitivity study and performance comparison between the Recompute Scheme and Naive Checkpointing.
INTRODUCTIONNon-volatile memory (NVM) technologies have been advancing rapidly, and some of them are a strong contender for use as a future main memory, either for augmenting or replacing DRAM. One such an example is 3D Xpoint memory, which will be brought to market in 2017 by Intel and Micron [28]. These new non-volatile main memory (NVMM) technologies are byte-addressable and have access latencies that are not much slower than DRAM [3,6,8,28,34,37,38,50]. NVMMs are expected to have a limited write endurance, making it imperative to keep the number of writes low [7]. Despite the limited write endurance and relatively high write latency compared to DRAM, NVMM's density and cost advantage over DRAM and near-zero idle power consumption make them a compelling candidate to replace or augment DRAM in high-performance computers [5].At the same time, as high-performance computing (HPC) relies on an increasing number of nodes and components, it becomes increasingly likely that long-running computation will be interrupted by failures before completing. Frequent checkpointing has become essential because it allows applications to resume from a recent snapshot rather than re-execute from ...