Abstract. The predicted reduced resiliency of next-generation high performance computers means that it will become necessary to take into account the effects of randomly occurring faults on numerical methods. Further, in the event of a hard fault occurring, a decision has to be made as to what remedial action should be taken in order to resume the execution of the algorithm. The action that is chosen can have a dramatic effect on the performance and characteristics of the scheme. Ideally, the resulting algorithm should be subjected to the same kind of mathematical analysis that was applied to the original, deterministic variant.The purpose of this work is to provide an analysis of the behaviour of the multigrid algorithm in the presence of faults. Multigrid is arguably the method of choice for the solution of large-scale linear algebra problems arising from discretization of partial differential equations and it is of considerable importance to anticipate its behaviour on an exascale machine. The analysis of resilience of algorithms is in its infancy and the current work is perhaps the first to provide a mathematical model for faults and analyse the behaviour of a state-of-the-art algorithm under the model. It is shown that the Two Grid Method fails to be resilient to faults. Attention is then turned to identifying the minimal necessary remedial action required to restore the rate of convergence to that enjoyed by the ideal fault-free method.
IntroductionPresident Obama's executive order in the summer of 2015 establishing the National Strategic Computing Initiative 1 committed the US to the development of a capable exascale computing system. Given that the performance of the current number one machine Tianhe-2 is roughly one thirtieth of that of an exascale system, it is easy to underestimate the challenge posed by this task. One way to envisage the scale of the undertaking is that the combined processing power of the entire TOP 500 list is less than half of one exaflop (10 18 floating point operations per second).It is widely accepted that an exascale machine should respect a 20MW power envelope. Tianhe-2 already consumes 18MW of power, and if it were possible to simply upscale to exascale using the current technology, would require around 2010 Mathematics Subject Classification. 65F10, 65N22, 65N55, 68M15.