Kai Keller scite author profile

Kai Keller

4Publications

29Citation Statements Received

67Citation Statements Given

How they've been cited

How they cite others

Affiliations

Barcelona Supercomputing Center, University of Wisconsin System, University of Hagen

Publications

Order By: Most citations

A method for reducing the target fault list of crosstalk faults in synchronous sequential circuits

Takahashi

Keller

et al. 2005

IEEE Trans. Comput.-Aided Des. Integr. Circuits Syst.

View full text Add to dashboard Cite

Abstract-In this paper, we describe a method of identifying a set of target crosstalk faults which may need to be tested in synchronous sequential circuits. Our method classifies the pairs of aggressor and victim lines, using topological and timing information, to deduce a set of target crosstalk faults. In this process, our method also identifies the false crosstalk faults that need not (and/or cannot) be tested in synchronous sequential circuits. Experimental results for ISCAS'89 and ITC'99 benchmark circuits show that the proposed method is CPU time efficient in obtaining the reduced lists of the target crosstalk faults. Also, the lists of the target crosstalk faults obtained by our method are substantially smaller than the sets of all possible combinations of faults.Index Terms-Crosstalk faults, lists of the target crosstalk faults, synchronous sequential circuits.

show abstract

Checkpoint Restart Support for Heterogeneous HPC Applications

Parasyris

Keller

Bautista-Gomez

et al. 2020

View full text Add to dashboard Cite

As we approach the era of exa-scale computing, fault tolerance is of growing importance. The increasing number of cores as well as the increased complexity of modern heterogenous systems result in substantial decrease of the expected mean time between failures. Among the different fault tolerance techniques, checkpoint/restart is vastly adopted in supercomputing systems. Although many supercomputers in the TOP 500 list use GPUs, only a few checkpoint restart mechanism support GPUs. In this paper, we extend an application level checkpoint library, called fault tolerance interface (FTI), to support multinode/multi-GPU checkpoints. In contrast to previous work, our library includes a memory manager, which upon a checkpoint invocation tracks the actual location of the data to be stored and handles the data accordingly. We analyze the overhead of the checkpoint/restart procedure and we present a series of optimization steps to massively decrease the checkpoint and recovery time of our implementation. To further reduce the checkpoint time we present a differential checkpoint approach which writes only the updated data to the checkpoint file. Our approach is evaluated and, in the best case scenario, the execution time of a normal checkpoint is reduced by 15x in contrast with a non-optimized version, in the case of differential checkpoint the overhead can drop to 2.6% when checkpointing every 30s.

show abstract

On reducing the target fault list of crosstalk-induced delay faults in synchronous sequential circuits

Keller

Takahashi²,

Saluja³

et al.

View full text Add to dashboard Cite

Application-Level Differential Checkpointing for HPC Applications with Dynamic Datasets

Keller

Bautista-Gomez

2019

View full text Add to dashboard Cite

High-performance computing (HPC) requires resilience techniques such as checkpointing in order to tolerate failures in supercomputers. As the number of nodes and memory in supercomputers keeps on increasing, the size of checkpoint data also increases dramatically, sometimes causing an I/O bottleneck. Differential checkpointing (dCP) aims to minimize the checkpointing overhead by only writing data differences. This is typically implemented at the memory page level, sometimes complemented with hashing algorithms. However, such a technique is unable to cope with dynamic-size datasets. In this work, we present a novel dCP implementation with a new file format that allows fragmentation of protected datasets in order to support dynamic sizes. We identify dirty data blocks using hash algorithms. In order to evaluate the dCP performance, we ported the HPC applications xPic, LULESH 2.0 and Heat2D and analyze them regarding their potential of reducing I/O with dCP and how this data reduction influences the checkpoint performance. In our experiments, we achieve reductions of up to 62% of the checkpoint time.

show abstract

scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.

Contact Info

customersupport@researchsolutions.com

10624 S. Eastern Ave., Ste. A-614

Henderson, NV 89052, USA

This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.

Blog Terms and Conditions API Terms Privacy Policy Contact Cookie Preferences Do Not Sell or Share My Personal Information

Made with 💙 for researchers

Part of the Research Solutions Family.

Kai Keller

A method for reducing the target fault list of crosstalk faults in synchronous sequential circuits

Checkpoint Restart Support for Heterogeneous HPC Applications

On reducing the target fault list of crosstalk-induced delay faults in synchronous sequential circuits

Application-Level Differential Checkpointing for HPC Applications with Dynamic Datasets

Contact Info

Product

Resources

About