Abstract. Questions whether numerical simulation is reproducible or not have been reported in several sensitive applications. Numerical reproducibility failure mainly comes from the finite precision of computer arithmetic. Results of floating-point computation depends on the computer arithmetic precision and on the order of arithmetic operations. Massive parallel HPC which merges, for instance, many-core CPU and GPU, clearly modifies these two parameters even from run to run on a given computing platform. How to trust such computed results? This paper presents how three classic approaches in computer arithmetic may provide some first steps towards more numerical reproducibility.
Numerical reproducibility: context and motivationsAs computing power increases towards exascale, more complex and larger scale numerical simulations are performed in various domains. Questions whether such simulated results are reproducible or not have been reported more or less recently, e.g. in energy science [1], dynamic weather forecasting [2], atomic or molecular dynamic [3,4], fluid dynamic [5]. This paper focuses on numerical non-reproducibility due to the finite precision of computer arithmetic -see [6] for other issues about "reproducible research" in computational mathematics.The following example illustrates a typical failure of numerical reproducibility. In the energetic field, power system state simulation aims to compute at "real time" a reliable estimate of the bus voltages for a given power grid topology and a set of on-line measures. Numerically speaking, a large and sparse linear system is solved at every iteration of a Newton-Raphson process. The core computation is a sparse matrix-vector product automatically parallelised by the computing environment. The authors of [1] exhibit a significant variability (up to 25% relative difference) between two runs on a massively multithreaded system. The culprit? Here as in the previously cited references: non deterministic sums.Floating-point summation is not associative. Parallelism introduces non deterministic events from run to run, even using a unique binary file and a given computing platform. The order of communications, the number of computing units (threads, processors) and the associated data placement may vary, and hence the parallel partial sums. Even sequential frames that comply with the IEEE-754 floating-point arithmetic standard [7] are still numerically very sensitive to many features: low-level arithmetic unit properties (variable precision registers, fused operators), compiler optimizations, language flaws or library versions reduce numerical repeatability and numerical portability [8]. * Authors thank Cl.-P. Jeannerod (INRIA) for his significant contribution, I. Said (LIP6) for his help in the numerical experiments related to the acoustic wave equation and the GT Arith, GDR Informatique Mathématique, for its support.