We report our experiences porting Spark to large production HPC systems. While Spark performance in a data center installation (with local disks) is dominated by the network, our results show that file system metadata access latency can dominate in a HPC installation using Lustre: it determines single node performance up to 4⇥ slower than a typical workstation. We evaluate a combination of software techniques and hardware configurations designed to address this problem. For example, on the software side we develop a file pooling layer able to improve per node performance up to 2.8⇥. On the hardware side we evaluate a system with a large NVRAM bu↵er between compute nodes and the backend Lustre file system: this improves scaling at the expense of per-node performance. Overall, our results indicate that scalability is currently limited to O(10 2 ) cores in a HPC installation with Lustre and default Spark. After careful configuration combined with our pooling we can scale up to O(10 4 ). As our analysis indicates, it is feasible to observe much higher scalability in the near future.
Although logically available, applications may not exploit enough instantaneous communication concurrency to maximize hardware utilization on HPC systems. This is exacerbated in hybrid programming models such as SPMD+OpenMP. We present the design of a "multi-threaded" runtime able to transparently increase the instantaneous network concurrency and to provide near saturation bandwidth, independent of the application configuration and dynamic behavior. The runtime forwards communication requests from application level tasks to multiple communication servers. Our techniques alleviate the need for spatial and temporal application level message concurrency optimizations. Experimental results show improved message throughput and bandwidth by as much as 150% for 4KB bytes messages on InfiniBand and by as much as 120% for 4KB byte messages on Cray Aries. For more complex operations such as all-to-all collectives, we observe as much as 30% speedup. This translates into 23% speedup on 12,288 cores for a NAS FT implemented using FFTW.We also observe as much as 76% speedup on 1,500 cores for an already optimized UPC+OpenMP geometric multigrid application using hybrid parallelism.
Exascale systems will require new approaches to performance observation, analysis, and runtime decision-making to optimize for performance and efficiency. The standard "first-person" model, in which multiple operating system processes and threads observe themselves and record first-person performance profiles or traces for offline analysis, is not adequate to observe and capture interactions at shared resources in highly concurrent, dynamic systems. Further, it does not support mechanisms for runtime adaptation. Our approach, called APEX (Autonomic Performance Environment for eXascale), provides mechanisms for sharing information among the layers of the software stack, including hardware, operating and runtime systems, and application code, both new and legacy. The performance measurement components share information across layers, merging first-person data sets with information collected by third-person tools observing shared hardware and software states at node-and global-levels. Critically, APEX provides a policy engine designed to guide runtime adaptation mechanisms to make algorithmic changes, re-allocate resources, or change scheduling rules when appropriate conditions occur.
Producing high-performance implementations from simple, portable computation specifications is a challenge that compilers have tried to address for several decades. More recently, a relatively stable architectural landscape has evolved into a set of increasingly diverging and rapidly changing CPU and accelerator designs, with the main common factor being dramatic increases in the levels of parallelism available. The growth of architectural heterogeneity and parallelism, combined with the very slow development cycles of traditional compilers, has motivated the development of autotuning tools that can quickly respond to changes in architectures and programming models, and enable very specialized optimizations that are not possible or likely to be provided by mainstream compilers. In this paper we describe the new OpenCL code generator and autotuner OrCL and the introduction of detailed performance measurement into the autotuning process. OrCL is implemented within the Orio autotuning framework, which enables the rapid development of experimental languages and code optimization strategies aimed at achieving good performance on new platforms without rewriting or handoptimizing critical kernels. The combination of the new OpenCL autotuning and TAU measurement capabilities enables users to consistently evaluate autotuning effectiveness across a range of architectures, including several NVIDIA and AMD accelerators and Intel Xeon Phi processors, and to compare the OpenCL and CUDA code generation capabilities. We present results of autotuning several numerical kernels that typically dominate the execution time of iterative sparse linear system solution and key computations from a 3-D parallel simulation of solid fuel ignition.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.