Memory-bound algorithms show complex performance and energy consumption behavior on multicore processors. We choose the lattice Boltzmann method on an Intel Sandy Bridge cluster as a prototype scenario to investigate if and how single-chip performance and power characteristics can be generalized to the highly parallel case. First, we perform an analysis of a sparse-lattice lattice Boltzmann method implementation for complex geometries. Using a single-core performance model, we predict the intra-chip saturation characteristics and the optimal operating point in terms of energy-to-solution as a function of implementation details, clock frequency, vectorization, and number of active cores per chip. We show that high single-core performance and a correct choice of the number of active cores per chip are the essential optimizations for the lowest energy-to-solution at minimal performance degradation. Then we extrapolate to the Message Passing Interface (MPI)-parallel level and quantify the energy-saving potential of various optimizations and execution modes, where we find these guidelines to be even more important, especially when communication overhead is non-negligible. In our setup, we could achieve energy savings of 35% in this case, compared with a naive approach. We also demonstrate that a simple non-reflective reduction of the clock speed leaves most of the energy-saving potential unused.[1-10]. Here, we conduct a thorough analysis of performance and energy-to-solution on the chip and highly parallel levels for an MPI-parallel implementation of LBM. We start from observations of the intra-chip saturation characteristics of two different implementations, which differ in the order in which the flow data in the lattice sites are updated ('propagation methods' [10]). Then we apply the execution-cache-memory (ECM) performance model and a simple multicore power model to describe the optimal operating point in terms of performance and energy-to-solution as a function of the clock frequency and the single instruction multiple data (SIMD) vectorization. To find out whether the knowledge thus gained at the chip level can be generalized to the highly parallel case, we conduct scaling experiments on a modern cluster system up to a point where MPI communication overhead becomes significant.This paper is organized as follows. The remainder of Section 1 covers related work, the basics of the lattice Boltzmann implementations, the hardware used for testing, and a list of contributions. Section 2 then introduces, applies, and validates the ECM model on the Intel Sandy Bridge architecture. In Section 3, we use a recently introduced multicore power model to identify the optimal operating points on the chip. Finally, Section 4 presents performance data for highly parallel runs and analyzes the impact of the different parameters (clock speed, number of cores per chip, SIMD vectorization, and system baseline power). Section 5 gives a summary and an outlook to future research.
Related workThe roofline model of Williams et al. [11] pr...