High Performance Computing on the IBM Power8 Platform

Reguly, István Z.; Keita, Abdoul-Kader; Zurob, Rafik; Giles, Michael B.

doi:10.1007/978-3-319-46079-6_17

Cited by 6 publications

(3 citation statements)

References 4 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…This is only about 4% and 3.2% of the DGEMM peak of the two processors on each system (623.19 GFlops/sec and 482.53 GFlops/sec respectively). For the two socket POWER8 the most time consuming kernel achieved about 52 GFlops/s which is about 10% of peak (501 GFlops [51]). Such a low computation performance achieved of the peak is more prominent on the P100 and V100 GPUs with only less than 3% achieved (out of 4.7TFlops/sec and 7TFlops/sec respectively) on either of the GPUs for the most time consuming kernel.…”

Section: Computation and Bandwidth Performancementioning

confidence: 99%

Large-scale performance of a DSL-based multi-block structured-mesh application for Direct Numerical Simulation

Mudalige

Reguly

Jammy

et al. 2019

Journal of Parallel and Distributed Computing

View full text Add to dashboard Cite

SBLI (Shock-wave/Boundary-layer Interaction) is a large-scale Computational Fluid Dynamics(CFD) application, developed over 20 years at the University of Southampton and extensively used within the UK Turbulence Consortium. It is capable of performing Direct Numerical Simulations (DNS) or Large Eddy Simulation (LES) of shock-wave/boundarylayer interaction problems over highly detailed multi-block structured mesh geometries. SBLI presents major challenges in data organization and movement that need to be overcome for continued high performance on emerging massively parallel hardware platforms. In this paper we present research in achieving this goal through the OPS embedded domainspecific language. OPS targets the domain of multi-block structured mesh applications. It provides an API embedded in C/C++ and Fortran and makes use of automatic code generation and compilation to produce executables capable of running on a range of parallel hardware systems. The core functionality of SBLI is captured using a new framework called OpenSBLI which enables a developer to declare the partial differential equations using Einstein notation and then automatically carryout discretization and generation of OPS (C/C++) API code. OPS is then used to automatically generate a wide range of parallel implementations. Using this multi-layered abstractions approach we demonstrate how new opportunities for further optimizations can be gained, such as fine-tuning the computation intensity and reducing data movement and apply them automatically. Performance results demonstrate there is no performance loss due to the high-level development strategy with OPS and OpenSBLI, with performance matching or exceeding the hand-tuned original code on all CPU nodes tested. The data movement optimizations provide over 3× speedups on CPU nodes, while GPUs provide 5× speedups over the best performing CPU node. The OPS generated parallel code also demonstrates excellent scalability on nearly 100K cores on a Cray XC30 (ARCHER at EPCC) and on over 4K GPUs on a CrayXK7 (Titan at ORNL).

show abstract

Section: Computation and Bandwidth Performancementioning

confidence: 99%

Large-scale performance of a DSL-based multi-block structured-mesh application for Direct Numerical Simulation

Mudalige

Reguly

Jammy

et al. 2019

Journal of Parallel and Distributed Computing

View full text Add to dashboard Cite

show abstract

“…[19] compare OpenMP 4.5 with Cray to Ope-nACC on Nekbone, however the analysis here is also restricted to runtimes, the focus is more on programmability. We are not aware of academic papers studying the performance of CUDA Fortran or OpenMP 4 in the IBM XL compilers aside from early results in our own previous work [20]. There is also very little work on comparing the performance of CUDA code compiled with nvcc and clang.…”

Section: Related Workmentioning

confidence: 99%

Comparison of Parallelisation Approaches, Languages, and Compilers for Unstructured Mesh Algorithms on GPUs

Balogh

Reguly

Mudalige

2017

Lecture Notes in Computer Science

View full text Add to dashboard Cite

Efficiently exploiting GPUs is increasingly essential in scientific computing, as many current and upcoming supercomputers are built using them. To facilitate this, there are a number of programming approaches, such as CUDA, OpenACC and OpenMP 4, supporting different programming languages (mainly C/C++ and Fortran). There are also several compiler suites (clang, nvcc, PGI, XL) each supporting different combinations of languages. In this study, we take a detailed look at some of the currently available options, and carry out a comprehensive analysis and comparison using computational loops and applications from the domain of unstructured mesh computations. Beyond runtimes and performance metrics (GB/s), we explore factors that influence performance such as register counts, occupancy, usage of different memory types, instruction counts, and algorithmic differences. Results of this work show how clang's CUDA compiler frequently outperform NVIDIA's nvcc, performance issues with directive-based approaches on complex kernels, and OpenMP 4 support maturing in clang and XL; currently around 10% slower than CUDA.

show abstract

“…The HPC system software for alternative platforms is still under development; for example, the first math libraries for ARM-based servers were released three years ago [93]. Similar studies confirm that system software stack on alternative platforms is relatively immature, which limits the achievable performance [88,94,95]. Finally, ThunderX shows very low FLOPS and memory utilization of 23% and 27%, respectively.…”

Section: Theoretical Vs Sustained Flops/s and Memory Bandwidthmentioning

confidence: 97%

Memory bandwidth and latency in HPC: system requirements and performance impact

Radulović¹

View full text Add to dashboard Cite

A major contributor to the deployment and operational costs of a large-scale high-performance computing (HPC) clusters is the memory system. In terms of system performance it is one of the most critical aspects of the system’s design. However, next generation of HPC systems poses significant challenges for the main memory, and it is questionable whether current memory technologies will meet the required goals. In this thesis we focus on HPC performance aspects of the memory system design, covering memory bandwidth and latency. We start our study by evaluating and comparing three mainstream and five alternative HPC architectures, regarding memory bandwidth and latency aspects. Increasing diversity of HPC systems in the market causes their evaluation and comparison in terms of HPC features to become complex. There is as yet no well established methodology for a unified evaluation of HPC systems and workloads that quantifies the main performance bottlenecks. Our work provides a significant body of useful information and emphasizes four usually overlooked aspects of HPC systems’ evaluation. Understanding the dominant performance bottlenecks of HPC applications is essential for designing a balanced HPC system. In our study, we execute a set of real HPC applications from diverse scientific fields, quantifying FLOPS performance and memory bandwidth congestion. We show that the results depend significantly on the number of execution processes, and argue for guidance on selecting the representative scale of the experiments. Also, we find that average measurements of performance metrics and bottlenecks can be highly misleading, and suggest reporting as the percentage of execution time in which applications use certain portions of maximum sustained values. Innovations in 3D-stacking technology enable DRAM devices with much higher bandwidths than traditional DIMMs. The first such products hit the market, and some of the publicity claims that they will break through the memory wall. We summarize our preliminary analysis and expectations of how such 3D-stacked DRAMs will affect the memory wall for a set of representative HPC applications. We conclude that although 3D-stacked DRAM is a major technological innovation, it is unlikely to break through the memory wall. Novel memory systems are typically explored by hardware simulators that are slow and often have a simplified or obsolete model of the CPU. We propose an analytical model that quantifies the impact of the main memory on application performance and system power and energy consumption, based on the memory system and application profiles. The model is evaluated on a mainstream platform, comprising various DDR3 memory configurations, and an alternative platform comprising DDR4 and 3D-stacked high-bandwidth memory. The evaluation results show that the model predictions are accurate, typically with only 2% difference from the values measured on actual hardware. Additionally, we compare the model performance estimation with simulation results, and our model shows significantly better accuracy over the simulator, while being faster by three orders of magnitude. Overall, we believe our study provides valuable insights on the importance of memory bandwidth and latency in HPC: their role in evaluation and comparison of HPC platforms, guidelines on measuring and presenting the related performance bottlenecks, and understanding and modeling of their performance, power and energy impact. Un contribuyente importante a la implementación y los costos operativos de un clúster de computación de altas prestaciónes (HPC) es el sistema de memoria. En términos de prestación del sistema, es uno de los aspectos más críticos del diseño. Sin embargo, la próxima generación de sistemas HPC plantea desafíos importantes para la memoria principal, y es cuestionable si las tecnologías de memoria actuales cumplirán con los objetivos requeridos. En esta tesis, nos centramos en los aspectos de prestación de HPC del diseño del sistema de memoria, que cubren el ancho de banda y la latencia de la memoria. Comenzamos evaluando y comparando tres arquitecturas HPC principales y cinco alternativas, con respecto al ancho de banda de la memoria y los aspectos de latencia. La creciente diversidad de los sistemas de HPC en el mercado hace que su evaluación y comparación en términos de características de HPC se convierta en compleja. Todavía no existe una metodología bien establecida para una evaluación unificada de los sistemas HPC y las cargas de trabajo que cuantifique los principales impedimentos en la prestación. Nuestro trabajo proporciona un cuerpo importante de información útil y enfatiza cuatro aspectos que generalmente se pasan por alto en la evaluación de los sistemas HPC. Aprender los impedimentos dominantes en la prestación de las aplicaciones de HPC es esencial para diseñar un sistema de HPC equilibrado. En nuestro estudio, ejecutamos un grupo de aplicaciones reales de HPC de diversos campos científicos, cuantificando la prestación de FLOPS y congestión de ancho de banda de memoria. Mostramos que los resultados dependen significativamente de la cantidad de procesos de ejecución, y argumentamos para obtener orientación sobre la selección de la escala representativa de los experimentos. Además, encontramos que las mediciones promedio de métricas de rendimiento y impedimiento puede ser muy engañoso, y sugerir informes como el porcentaje del tiempo de ejecución en el que las aplicaciones utilizan ciertas partes de los valores máximos sostenidos. Las innovaciones en la tecnología de 3D permiten que los dispositivos DRAM tengan un ancho de banda mucho mayor que los módulos DIMM tradicionales. El primero de estos productos llegó al mercado, y algunas de las publicidades afirman que romperán el "Memory wall". Resumimos nuestro análisis preliminar y las expectativas de cómo dichas DRAM apiladas en 3D afectarán el "Memory wall" para un grupo de aplicaciones representativas de HPC. Llegamos a la conclusión de que, aunque la DRAM apilada en 3D es una innovación tecnológica importante, es improbable que rompa el "Memory wall". Los sistemas de memoria nuevos ser explorados por simuladores de hardware que son lentos y tienen un modelo simplificado u obsoleto de la CPU. Proponemos un modelo analítico que cuantifica el impacto de la memoria principal en el prestación de la aplicación y la potencia del sistema y el consumo de energía, según el sistema de memoria y los perfiles de la aplicación. El modelo se evalúa en una plataforma que comprende varias configuraciones de memoria DDR3, y una plataforma alternativa que comprende DDR4 y memoria de alto ancho de banda apilada en 3D. Los resultados de la evaluación muestran que las predicciones del modelo son precisas, generalmente con una diferencia de solo el 2% de los valores medidos en el hardware real. Además, comparamos la estimación del rendimiento del modelo con los resultados de la simulación, y nuestro modelo muestra una precisión significativamente mayor en el simulador, al mismo tiempo que es más rápido en tres órdenes de magnitud. En general, creemos que nuestro estudio proporciona información valiosa sobre la importancia del ancho de banda de la memoria y la latencia en HPC: su rol en la evaluación y comparación de plataformas HPC, las pautas para medir y presentar los impedimientos de la prestación y la comprensión y el impacto energético

show abstract

High Performance Computing on the IBM Power8 Platform

Cited by 6 publications

References 4 publications

Large-scale performance of a DSL-based multi-block structured-mesh application for Direct Numerical Simulation

Large-scale performance of a DSL-based multi-block structured-mesh application for Direct Numerical Simulation

Comparison of Parallelisation Approaches, Languages, and Compilers for Unstructured Mesh Algorithms on GPUs

Memory bandwidth and latency in HPC: system requirements and performance impact

Contact Info

Product

Resources

About