Current and planned computer systems present challenges for scientific programming. Memory capacity and bandwidth are limiting performance as floating point capability increases due to more cores per processor and wider vector units. Effectively using hardware requires finding greater parallelism in programs while using relatively less memory. In this poster, we present how we tuned the Livermore Unstructured Lagrange Explicit Shock Hydrodynamics proxy application for on-node performance resulting in 62% fewer memory reads, a 19% smaller memory footprint, 770% more floating point operations vectorizing and less than 0.1% serial section runtime. Tests show serial code version runtime decreases of up to 57% and parallel runtime reductions of up to 75%. We are also applying these optimizations to GPUs and a subset of ALE3D, from which the proxy application was derived. So far we achieve up to a 1.9x speedup on GPUs, and a 13% runtime reduction in the application for the same problem. I. INTRODUCTIONHydrodynamics is widely used to model continuum material properties and material interactions in the presence of applied forces. It can consume up to one third the runtime of these applications. To provide a simpler, but still full-featured problem to test various tuning techniques and different programming models the Livermore Unstructured Lagrange Explicit Shock Hydro (LULESH) mini-app was created as one of five challenge problems in the DARPA UHPC program [1]. LULESH solves the sedov problem by modeling one octant of a symmetrical blast wave.We are using LULESH to test optimization techniques and programming practices that increase the performance of code on current and future architectures. By using a mini-app we can quickly explore and evaluate techniques that hold promise before making the more extensive changes needed in production codes. We focus on increasing hardware parallelism utilization, reducing memory traffic and decreasing memory footprint. Optimizations target on-node memory bandwidth, memory footprint and parallelism, because architectural trends are resulting in machines with less memory per core, less relative bandwidth and more on-node parallelism.We applied six optimizations to LULESH: loop fusion, array contraction, data layout changes, increased vectorization, NUMA aware allocation and allocation of temporaries outside the timestep loop. These changes reduced last level cache misses by over 62%, the global state size of the program by 19%, serial section to less than 0.1% of the overall runtime,
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.