CPRtree: A Tree-Based Checkpointing Architecture for Heterogeneous FPGA Computing

Vu, Hoang-Gia; Kajkamhaeng, Supasit; Takamaeda-Yamazaki, Shinya; Nakashima, Yasuhiko

doi:10.1109/candar.2016.0024

Cited by 7 publications

(2 citation statements)

References 17 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Several checkpointing strategies have been proposed in the literature. A tree based checkpointing architecture called CPRTree [19] was proposed. Their approach was based on Hardware Description Language (HDL) and consisted in saving and restoring the state of all elements that define the context (registers, RAM and wires).…”

Section: B Fpga Checkpointing Studiesmentioning

confidence: 99%

FPGA Checkpointing for Scientific Computing

Bacardit

Bautista-Gomez

Ünsal

2021

2021 IEEE 27th International Symposium on on-Line Testing and Robust System Design (IOLTS)

View full text Add to dashboard Cite

The use of FPGAs in computational workloads is becoming increasingly popular due to the flexibility of these devices in comparison to ASICs, and their low power consumption compared to GPUs and CPUs. However, scientific applications run for long periods of time and the hardware is always subject to failures due to either soft or hard errors. Thus, it is important to protect these long running jobs with fault tolerance mechanisms. Checkpoint-Restart is a popular technique in high-performance computing that allows large scale applications to cope with frequent failures.In this work we approach the fault tolerance of CPU-FPGA heterogeneous applications from a high level by using OmpSs@FPGA environment and a multi-level checkpointing library. We analyse the performance of several different applications and we understand what kind of overheads we can expect from checkpointing computational workloads running on FPGAs. Our results demonstrate overheads as low as 0.16% and 0.66% when checkpointing very frequently, indicating that this technique is efficient and does not add a significant amount of overhead to the system. In addition, we showcase a proof of concept for checkpointing partial data of the FPGA task itself. This can prove useful for workloads in which most data is offloaded to the FPGA memory at once and do not constantly move all the data between the accelerator and the CPU.

show abstract

Section: B Fpga Checkpointing Studiesmentioning

confidence: 99%

FPGA Checkpointing for Scientific Computing

Bacardit

Bautista-Gomez

Ünsal

2021

2021 IEEE 27th International Symposium on on-Line Testing and Robust System Design (IOLTS)

View full text Add to dashboard Cite

show abstract

“…The first four contributions were presented at the Third International Symposium on Computing and Networking (CANDAR 2016) [6]. After that, a Python-based software to generate checkpointing source code was developed.…”

Section: Introductionmentioning

confidence: 99%

A Tree-Based Checkpointing Architecture for the Dependability of FPGA Computing

Takamaeda-Yamazaki

Nakada

et al. 2018

IEICE Trans. Inf. & Syst.

Self Cite

View full text Add to dashboard Cite

SUMMARYModern FPGAs have been integrated in computing systems as accelerators for long running applications. This integration puts more pressure on the fault tolerance of computing systems, and the requirement for dependability becomes essential. As in the case of CPU-based system, checkpoint/restart techniques are also expected to improve the dependability of FPGA-based computing. Three issues arise in this situation: how to checkpoint and restart FPGAs, how well this checkpoint/restart model works with the checkpoint/restart model of the whole computing system, and how to build the model by a software tool. In this paper, we first present a new checkpoint/restart architecture along with a checkpointing mechanism on FPGAs. We then propose a method to capture consistent snapshots of FPGA and the rest of the computing system. Third, we provide "fine-grained" management for checkpointing to reduce performance degradation. For the host CPU, we also provide a stack which includes API functions to manage checkpoint/restart procedures on FPGAs. Fourth, we present a Python-based tool to insert checkpointing infrastructure. Experimental results show that the checkpointing architecture causes less than 10% maximum clock frequency degradation, low checkpointing latencies, small memory footprints, and small increases in power consumption, while the LUT overhead varies from 17.98% (Dijkstra) to 160.67% (Matrix Multiplication).

show abstract