Deploying large numbers of small, low-power cores has been gaining traction recently as a system design strategy in high performance computing (HPC). The ARM platform that dominates the embedded and mobile computing segments is now being considered as an alternative to high-end x86 processors that largely dominate HPC because peak performance per watt may be substantially improved using off-the-shelf commodity processors.In this work we methodically characterize the performance and energy of HPC computations drawn from a number of problem domains on current ARM and x86 processors. Unsurprisingly, we find that the performance, energy and energy-delay product of applications running on these platforms varies significantly across problem types and inputs. Using static program analysis we further show that this variation can be explained largely in terms of the capabilities of two processor subsystems: single instruction multiple data (SIMD)/floating point and the cache/memory hierarchy; and that static analysis of this kind is sufficient to predict which platform is best for a particular application/input pair. In the context of these findings, we evaluate how some of the key architectural changes being made for upcoming 64-bit ARM platforms may impact HPC application performance.
Intel's Xeon Phi co-processor has the potential to provide an impressive 4 GFlops/Watt while promising users that they need only to recompile their code to get it to run on the accelerator. This paper reports our experience on running LAMMPS, a widely-used molecular dynamics code, on the Xeon Phi and the steps we took to optimize its performance on the device. Using performance analysis tools to pinpoint bottlenecks in the code, we were able to achieve a speedup of 2.8x from running the original code on the host processors vs. the optimized code on the Xeon Phi. These optimizations also resulted in an improved LAMMPS' performance on the host -speeding up the execution by 7x.
The trifecta of power, performance and programmability has spurred significant interest in the 64-bit ARMv8 platform. These new systems provide energy efficiency, a traditional CPU programming model, and the potential of high performance when enough cores are thrown at the problem. However, it remains unclear how well the ARM architecture will work as a design point for the High Performance Computing market. In this paper, we characterize and investigate the key architectural factors that impact power and performance on a current ARMv8 offering (X-Gene 1) and Intel's Sandy Bridge processor. Using Principal Component Analysis, multiple linear regression models, and variable importance analysis we conclude that the CPU frontend has the biggest impact on performance on both the X-Gene and Sandy Bridge processors.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.