Efficient software checking for fault tolerance

Yu, Jing; Garzarán, María Jesús; Snir, Marc

doi:10.1109/ipdps.2008.4536435

Cited by 4 publications

(2 citation statements)

References 47 publications

(92 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…The rollback is the ability to return to a previously valid state of the processor in the case, for instance, of an execution error or a power failure. We assume that an error detection mechanism is available in the processor architecture to identify errors during execution as proposed, for instance, in Yu et al [2008] or Wali et al [2016]. The principle of the rollback is shown in Figure 7.…”

Section: Rollbackmentioning

confidence: 99%

Non-Volatile Processor Based on MRAM for Ultra-Low-Power IoT Devices

Senni

Torres

Sassatelli

et al. 2016

J. Emerg. Technol. Comput. Syst.

View full text Add to dashboard Cite

Over the past few years, a new era of smart connected devices has emerged in the market to enable the future world of the Internet of Things (IoT). A key requirement for IoT applications is the power consumption to allow very high autonomy in the case of battery-powered systems. Depending on the application, such devices will be most of the time in a low-power mode (sleep mode) and will wake up only when there is a task to accomplish (active mode). Emerging non-volatile memory technologies are seen as a very attractive solution to design ultra-low-power systems. Among these technologies, magnetic random access memory is a promising candidate, as it combines non-volatility, high density, reasonable latency, and low leakage. Integration of non-volatility as a new feature of memories has the great potential to allow full data retention after a complete shutdown with a fast wake-up time. This article explores the benefits of having a non-volatile processor to enable ultra-low-power IoT devices.

show abstract

Section: Rollbackmentioning

confidence: 99%

Non-Volatile Processor Based on MRAM for Ultra-Low-Power IoT Devices

Senni

Torres

Sassatelli

et al. 2016

J. Emerg. Technol. Comput. Syst.

View full text Add to dashboard Cite

show abstract

“…For example, the BlueGene/L experiences one soft error in its L1 cache every 4-6 hours [1]. All these factors make that the massive parallel CFD applications are more vulnerable to the failure attack [2], [3]. Checkpoint/Restart technology is a widely used fault tolerant (FT) method, which periodically backups the intermediate result to the stable storage, and rollbacks to the nearest checkpoint when a failure occurs.…”

Section: Introductionmentioning

confidence: 99%

The Analysis of Checkpoint Strategies for Large-Scale CFD Simulation in HPC System

Ren

Tang

et al. 2014

2014 Fourth International Conference on Communication Systems and Network Technologies

View full text Add to dashboard Cite

With the development of the electronic technology, the processors count in a supercomputer reaches million scale, making the fault problem becomes a fundamental issue for massive parallel CFD simulation. Checkpoint/Rollback technology is a widely used fault tolerant method, and has a obvious affect for massive parallel application. In this paper, we explore the checkpoint method for the CFD simulation with the CFD simulation features, and analysis the two checkpoint strategies: fine granularity checkpoint and coarse checkpoint. We analysis the checkpoint intervals and the volume of the backup data, and their impact on the FT overhead through model. Experimental results on the Tianhe-2 supercomputer demonstrate that coarse checkpoint can achieve a much better FT effect for the CFD simulation.

show abstract