Towards Efficient Cache Allocation for High-Frequency Checkpointing

Maurya, Avinash; Nicolae, Bogdan; Rafique, M. Mustafa; El-Sayed, Amr; Tonellot, Thierry; Cappello, F

doi:10.1109/hipc56025.2022.00043

2022 IEEE 29th International Conference on High Performance Computing, Data, and Analytics (HiPC) 2022

DOI: 10.1109/hipc56025.2022.00043

|View full text |Cite

Towards Efficient Cache Allocation for High-Frequency Checkpointing

Avinash Maurya

Bogdan Nicolae

M. Mustafa Rafique

et al.

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...

Citation Types

Supporting

Mentioning

Contrasting

Year Published

2023

Publication Types

Select...

Other3

Relationship

Self Cite1

Independent2

Authors

Journals

Cited by 3 publications

References 38 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

Towards Efficient I/O Pipelines Using Accumulated Compression

Maurya,

Nicolae,

Rafique

et al. 2023

2023 IEEE 30th International Conference on High Performance Computing, Data, and Analytics (HiPC)

Self Cite

View full text Add to dashboard Cite

High-Performance Computing (HPC) workloads generate large volumes of data at high-frequency during their execution, which needs to be captured concurrently at scale. These workloads exploit accelerators such as GPU for faster performance. However, the limited onboard high-bandwidth memory (HBM) on the GPU, and slow device-to-host memory PCIe interconnects lead to I/O overheads during application execution, thereby exacerbating their overall runtime. To overcome the aforementioned limitations, techniques such as compression and asynchronous transfers have been used by data management runtimes. However, compressing small blocks of data leads to a significant runtime penalty on the application. In this paper, we design and develop strategies to optimize the tradeoff between compressing checkpoints instantly and enqueuing transfers immediately versus accumulating snapshots and delaying compression to achieve faster compression throughput. Our evaluations on synthetic and real-life workloads for different systems and workload configurations demonstrate 1.3× to 8.3× speedup compared to the existing checkpoint approaches.

show abstract

Towards Efficient I/O Pipelines Using Accumulated Compression

Maurya,

Nicolae,

Rafique

et al. 2023

2023 IEEE 30th International Conference on High Performance Computing, Data, and Analytics (HiPC)

Self Cite

View full text Add to dashboard Cite

show abstract

Evaluating Asynchronous Parallel I/O on HPC Systems

Ravi

Byna

Koziol

et al. 2023

2023 IEEE International Parallel and Distributed Processing Symposium (IPDPS)

View full text Add to dashboard Cite

GPU-Enabled Asynchronous Multi-level Checkpoint Caching and Prefetching

Maurya,

Rafique,

Tonellot

et al. 2023

Proceedings of the 32nd International Symposium on High-Performance Parallel and Distributed Computing

View full text Add to dashboard Cite

Checkpointing is an I/O intensive operation increasingly used by High-Performance Computing (HPC) applications to revisit previous intermediate datasets at scale. Unlike the case of resilience, where only the last checkpoint is needed for application restart and rarely accessed to recover from failures, in this scenario, it is important to optimize frequent reads and writes of an entire history of checkpoints. State-of-the-art checkpointing approaches often rely on asynchronous multi-level techniques to hide I/O overheads by writing to fast local tiers (e.g. an SSD) and asynchronously flushing to slower, potentially remote tiers (e.g. a parallel file system) in the background, while the application keeps running. However, such approaches have two limitations. First, despite the fact that HPC infrastructures routinely rely on accelerators (e.g. GPUs), and therefore a majority of the checkpoints involve GPU memory, efficient asynchronous data movement between the GPU memory and host memory is lagging behind. Second, revisiting previous data often involves predictable access patterns, which are not exploited to accelerate read operations. In this paper, we address these limitations by proposing a scalable and asynchronous multi-level checkpointing approach optimized for both reading and writing of an arbitrarily long history of checkpoints. Our approach exploits GPU memory as a first-class citizen in the multi-level storage hierarchy to enable informed caching and prefetching of checkpoints by leveraging foreknowledge about the access order passed by the application as hints. Our evaluation using a variety of scenarios under I/O concurrency shows up to 74× faster checkpoint and restore throughput as compared to the state-of-art runtime and optimized unified virtual memory (UVM) based prefetching strategies and at Publication rights licensed to ACM.

show abstract

scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.

Contact Info

customersupport@researchsolutions.com

10624 S. Eastern Ave., Ste. A-614

Henderson, NV 89052, USA

This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.

Blog Terms and Conditions API Terms Privacy Policy Contact Cookie Preferences Do Not Sell or Share My Personal Information

Made with 💙 for researchers

Part of the Research Solutions Family.

Towards Efficient Cache Allocation for High-Frequency Checkpointing

Cited by 3 publications

References 38 publications

Towards Efficient I/O Pipelines Using Accumulated Compression

Towards Efficient I/O Pipelines Using Accumulated Compression

Evaluating Asynchronous Parallel I/O on HPC Systems

GPU-Enabled Asynchronous Multi-level Checkpoint Caching and Prefetching

Contact Info

Product

Resources

About