Memory Performance of AMD EPYC Rome and Intel Cascade Lake SP Server Processors

Velten, Markus; Schöne, Robert; Ilsche, Thomas; Hackenberg, Daniel

doi:10.1145/3489525.3511689

Cited by 16 publications

(3 citation statements)

References 20 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Further complicating matters, shared cache bandwidth is a complex topic since it is often tied to core and fabric clockspeeds. These and other aspects are studied in further detail by [56,57]. Once again, this is not the whole picture.…”

Section: Fair Cpu and Gpu Comparisonsmentioning

confidence: 99%

A GPU-Accelerated Particle Advection Methodology for 3D Lagrangian Coherent Structures in High-Speed Turbulent Boundary Layers

Lagares

Araya

2023

Energies

View full text Add to dashboard Cite

In this work, we introduce a scalable and efficient GPU-accelerated methodology for volumetric particle advection and finite-time Lyapunov exponent (FTLE) calculation, focusing on the analysis of Lagrangian coherent structures (LCS) in large-scale direct numerical simulation (DNS) datasets across incompressible, supersonic, and hypersonic flow regimes. LCS play a significant role in turbulent boundary layer analysis, and our proposed methodology offers valuable insights into their behavior in various flow conditions. Our novel owning-cell locator method enables efficient constant-time cell search, and the algorithm draws inspiration from classical search algorithms and modern multi-level approaches in numerical linear algebra. The proposed method is implemented for both multi-core CPUs and Nvidia GPUs, demonstrating strong scaling up to 32,768 CPU cores and up to 62 Nvidia V100 GPUs. By decoupling particle advection from other problems, we achieve modularity and extensibility, resulting in consistent parallel efficiency across different architectures. Our methodology was applied to calculate and visualize the FTLE on four turbulent boundary layers at different Reynolds and Mach numbers, revealing that coherent structures grow more isotropic proportional to the Mach number, and their inclination angle varies along the streamwise direction. We also observed increased anisotropy and FTLE organization at lower Reynolds numbers, with structures retaining coherency along both spanwise and streamwise directions. Additionally, we demonstrated the impact of lower temporal frequency sampling by upscaling with an efficient linear upsampler, preserving general trends with only 10% of the required storage. In summary, we present a particle search scheme for particle advection workloads in the context of visualizing LCS via FTLE that exhibits strong scaling performance and efficiency at scale. Our proposed algorithm is applicable across various domains, requiring efficient search algorithms in large, structured domains. While this article focuses on the methodology and its application to LCS, an in-depth study of the physics and compressibility effects in LCS candidates will be explored in a future publication.

show abstract

Section: Fair Cpu and Gpu Comparisonsmentioning

confidence: 99%

A GPU-Accelerated Particle Advection Methodology for 3D Lagrangian Coherent Structures in High-Speed Turbulent Boundary Layers

Lagares

Araya

2023

Energies

View full text Add to dashboard Cite

show abstract

“…Additionally, the TLB latency overlaps with the latency for the L1 cache as the cache is virtually indexed and physically tagged, which requires a TLB lookup in parallel. Velten et al [VSIH22] benchmarked the AMD EPYC 7702, which uses a 7nm process, and the Intel Xeon Gold 6248, which is manufactured in 14nm. The results show that the latency for the L1 cache is between 1.6ns and 2ns.…”

Section: Hardware Requirementsmentioning

confidence: 99%

Risky Translations: Securing TLBs against Timing Side Channels

Stolz

Thoma

Sasdrich

et al. 2022

TCHES

View full text Add to dashboard Cite

Microarchitectural side-channel vulnerabilities in modern processors are known to be a powerful attack vector that can be utilized to bypass common security boundaries like memory isolation. As shown by recent variants of transient execution attacks related to Spectre and Meltdown, those side channels allow to leak data from the microarchitecture to the observable architectural state. The vast majority of attacks currently build on the cache-timing side channel, since it is easy to exploit and provides a reliable, fine-grained communication channel. Therefore, many proposals for side-channel secure cache architectures have been made. However, caches are not the only source of side-channel leakage in modern processors and mitigating the cache side channel will inevitably lead to attacks exploiting other side channels. In this work, we focus on defeating side-channel attacks based on page translations.It has been shown that the Translation Lookaside Buffer (TLB) can be exploited in a very similar fashion to caches. Since the main caches and the TLB share many features in their architectural design, the question arises whether existing countermeasures against cache-timing attacks can be used to secure the TLB. We analyze state-ofthe-art proposals for side-channel secure cache architectures and investigate their applicability to TLB side channels. We find that those cache countermeasures are notdirectly applicable to TLBs, and propose TLBcoat, a new side-channel secure TLB architecture. We provide evidence of TLB side-channel leakage on RISC-V-based Linux systems, and demonstrate that TLBcoat prevents this leakage. We implement TLBcoat using the gem5 simulator and evaluate its performance using the PARSEC benchmark suite.

show abstract

“…5 in [28] the size of shared memory) of the SM. Although GPUs provide high memory bandwidth, the global memory access latency is also higher than that of CPUs [32,33,34]. Therefore, optimal throughput may be attained by covering memory requests with computational execution and hiding the latency of data movement.…”

Section: Gpu Architecturementioning

confidence: 99%

Advancing the state of the art of directive-based programming for GPUs: runtime and compilation

Matsumura

View full text Add to dashboard Cite

(English) The rapid development in computing technology has paved the way for directive-based programming models towards a principal role in maintaining software portability of performance-critical applications. Efforts on such models involve a least engineering cost for enabling computational acceleration on multiple architectures, while programmers are only required to add meta information upon sequential code. Optimizations for obtaining the best possible efficiency, however, are often challenging. The insertions of directives by the programmer can lead to side-effects that limit the available compiler optimization possible, which could result in performance degradation. This is exacerbated when targeting asynchronous execution or multi-GPU systems, as pragmas do not automatically adapt to such mechanisms, and require expensive and time consuming code adjustment by programmers. Moreover, directive-based programming models such as OpenACC and OpenMP often prevent programmers from making additional optimizations to take advantage of the advanced architectural features of GPUs because the actual generated computation is hidden from the application developer. This dissertation explores new possibilities for optimizing directive-based code from both runtime and compilation perspectives. First, we introduce a runtime framework for OpenACC to facilitate dynamic analysis and compilation. Especially, our framework realizes automatic asynchronous execution and multi-GPU use based on the status of kernel execution and data availability while taking advantage of an on-the-fly mechanism for compilation and program optimization. We add a versatile code-translation method for multi-device utilization by which manually-optimized applications can be distributed automatically while keeping original code structure and parallelism. Second, we implement a novel flexible optimization technique that operates by inserting a code emulator phase to the tail-end of the compilation pipeline. Our tool emulates the generated code using symbolic analysis by substituting dynamic information and thus allowing for further low-level code optimizations to be applied. We implement our tool to support both CUDA and OpenACC directives as the frontend of the compilation pipeline, thus enabling low-level GPU optimizations for OpenACC that were not previously possible. Third, we propose the use of a modern optimization technique, equality saturation, to optimize sequential code utilized in directive-based programming for GPUs. Our approach realizes less computation, less memory access, and high memory throughput simultaneously. Our fully-automated framework constructs single-assignment forms from inputs to be entirely rewritten while keeping dependencies and extracts optimal cases. Overall, we cover runtime techniques and optimization methods based on dynamic information, low-level operations, and user-level opportunities. We evaluate our proposals on the state-of-the-art GPUs and provide detailed analysis for each technique. For multi-GPU use, we show in some cases nearly linear scaling on the part of kernel execution with the NVIDIA V100 GPUs. While adaptively using multi-GPUs, the resulting performance improvements amortize the latency of GPU-to-GPU communications. Regarding low-level optimization, we demonstrate the capabilities of our tool by automating warp-level shuffle instructions that are difficult to use by even advanced GPU programmers. While evaluating our tool with a benchmark suite and complex application code, we provide a detailed study to assess the benefits of shuffle instructions across four generations of GPU architectures. Lastly, with sequential code optimization, we demonstrate a significant performance improvement on several compilers through practical benchmarks. Then, we highlight the advantages of computational reordering and emphasize the significance of memory-access order for modern GPUs. (Català) El desenvolupament ràpid de la tecnologia informàtica ha aplanat el camí perquè els models de programació basats en directives exerceixin un paper principal en el manteniment de la portabilitat del programari d'aplicacions de rendiment crític. computacional en múltiples arquitectures, mentre que als programadors només cal afegir metainformació al codi seqüencial. Tot i això, les optimitzacions per obtenir la millor eficiència possible solen ser un desafiament. La inserció de directives per part del programador pot provocar efectes secundaris que limitin la possible optimització disponible del compilador, cosa que podria provocar una degradació del rendiment. Això s'agrava quan s'apunta a l'execució asincrònica o sistemes multi-GPU, ja que els pragmes no s'adapten automàticament a aquests mecanismes i requereixen ajustaments de codi costosos i que requereixen molt de temps per part dels programadors. A més, els models de programació basats en directives com OpenACC i OpenMP sovint impedeixen que els programadors facin optimitzacions addicionals per aprofitar les característiques arquitectòniques avances de les GPU perquè el càlcul real generat és ocult per al desenvolupador de l'aplicació. Aquesta dissertació explora noves possibilitats per optimitzar el codi basat en directives tant des de la perspectiva del temps d’execució com de la compilació. Primer presentem un marc d'execució per a OpenACC per facilitar l'anàlisi i la compilació dinàmica. Especialment, el nostre marc realitza una execució asincrònica automàtica i un ús de múltiples GPU segons l'estat d'execució del nucli i la disponibilitat de dades, mentre aprofita un mecanisme sobre la marxa per a la compilació i l'optimització del programa. Afegim un mètode versàtil de traducció de codi per a la utilització de múltiples dispositius mitjançant la qual les aplicacions optimitzades manualment es poden distribuir automàticament mantenint l'estructura i el paral·lelisme del codi original. En segon lloc, implementem una nova tècnica d'optimització flexible que opera inserint una fase d'emulador de codi al final del procés de compilació. La nostra eina emula el codi generat mitjançant anàlisi simbòlica a substituir informació dinàmica i així permetre que s'apliquin més optimitzacions de codi de baix nivell. Implementem la nostra eina per admetre les directives CUDA iOpenACC com a interfície del procés de compilació, cosa que permet optimitzacions de GPU de baix nivell per a OpenACC que abans no eren possibles. En tercer lloc, us proposem l'ús d'una tècnica d'optimització moderna, l’aturació d'igualtat, per optimitzar el codi seqüencial utilitzat a la programació basada en directives per a GPU. El nostre enfocament aconsegueix menys computació, menys accés a la memòria i un alt rendiment de la memòria simultàniament. El nostre marc totalment automatitzat construeix la forma SSA a partir d'entrades que es reescriuran completament mantenint les dependències i extraient casos òptims. En general, cobrim tècniques de temps d'execució i mètodes de optimització basats en informació dinàmica, operacions de baix nivell i oportunitats a nivell d'usuari. (Español) El rápido desarrollo de la tecnología informática ha allanado el camino para que los modelos de programación basados en directivas desempeñen un papel principal en el mantenimiento de la portabilidad del software de aplicaciones de rendimiento crítico. Los esfuerzos en tales modelos implican un costo de ingeniería mínimo para permitir la aceleración computacional en múltiples arquitecturas, mientras que a los programadores solo se les requiere agregar metainformación al código secuencial. Sin embargo, las optimizaciones para obtener la mejor eficiencia posible suelen ser un desafío. La inserción de directivas por parte del programador puede provocar efectos secundarios que limiten la posible optimización disponible del compilador, lo que podría provocar una degradación del rendimiento. Esto se agrava cuando se apunta a la ejecución asincrónica o a sistemas multi-GPU, ya que los pragmas no se adaptan automáticamente a tales mecanismos y requieren ajustes de código costosos y que requieren mucho tiempo por parte de los programadores. Además, los modelos de programación basados en directivas como OpenACC y OpenMP a menudo impiden que los programadores realicen optimizaciones adicionales para aprovechar las características arquitectónicas avanzadas de las GPU porque el cálculo real generado está oculto para el desarrollador de la aplicación. Esta disertación explora nuevas posibilidades para optimizar el código basado en directivas tanto desde la perspectiva del tiempo de ejecución como de la compilación. Primero, presentamos un marco de ejecución para OpenACC para facilitar el análisis y la compilación dinámicos. Especialmente, nuestro marco realiza una ejecución asincrónica automática y un uso de múltiples GPU según el estado de ejecución del kernel y la disponibilidad de datos, mientras aprovecha un mecanismo sobre la marcha para la compilación y optimización del programa. Agregamos un método versátil de traducción de código para la utilización de múltiples dispositivos mediante el cual las aplicaciones optimizadas manualmente se pueden distribuir automáticamente manteniendo la estructura y el paralelismo del código original. En segundo lugar, implementamos una novedosa técnica de optimización flexible que opera insertando una fase de emulador de código al final del proceso de compilación. Nuestra herramienta emula el código generado mediante análisis simbólico al sustituir información dinámica y así permitir que se apliquen más optimizaciones de código de bajo nivel. Implementamos nuestra herramienta para admitir las directivas CUDA y OpenACC como interfaz del proceso de compilación, lo que permite optimizaciones de GPU de bajo nivel para OpenACC que antes no eran posibles. En tercer lugar, proponemos el uso de una técnica de optimización moderna, la saturación de igualdad, para optimizar el código secuencial utilizado en la programación basada en directivas para GPU. Nuestro enfoque logra menos computación, menos acceso a la memoria y un alto rendimiento de la memoria simultáneamente. Nuestro marco totalmente automatizado construye la forma SSA a partir de entradas que se reescribirán por completo manteniendo las dependencias y extrayendo casos óptimos. En general, cubrimos técnicas de tiempo de ejecución y métodos de optimización basados en información dinámica, operaciones de bajo nivel y oportunidades a nivel de usuario.

show abstract

Memory Performance of AMD EPYC Rome and Intel Cascade Lake SP Server Processors

Cited by 16 publications

References 20 publications

A GPU-Accelerated Particle Advection Methodology for 3D Lagrangian Coherent Structures in High-Speed Turbulent Boundary Layers

A GPU-Accelerated Particle Advection Methodology for 3D Lagrangian Coherent Structures in High-Speed Turbulent Boundary Layers

Risky Translations: Securing TLBs against Timing Side Channels

Advancing the state of the art of directive-based programming for GPUs: runtime and compilation

Contact Info

Product

Resources

About