Hierarchical Roofline analysis for GPUs: Accelerating performance optimization for the NERSC‐9 Perlmutter system

Yang, Charlene; Kurth, Thorsten; Williams, Samuel

doi:10.1002/cpe.5547

Cited by 60 publications

(59 citation statements)

References 9 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Finally, to gain a more detailed understanding of arithmetic intensity and its affects on the performance of the kernel, Figure 7 shows a``roofline"" plot of the performance of the main reconstruction kernel for varying orders of accuracy with and without WENO limiting using 200 \times 200 \times 100 = 4 million cells at 3rd-order temporal accuracy. The Roofline Model [33] is a technique of evaluating how well a particular operation is performing in comparison to the two main system limiters, bandwidth and computation. As the computational intensity of an operation increases, operations will naturally move up the bandwidth line until becoming limited by the FLOPs line.…”

Section: 12mentioning

confidence: 99%

A Holistic Algorithmic Approach to Improving Accuracy, Robustness, and Computational Efficiency for Atmospheric Dynamics

Norman¹,

Larkin²

2020

SIAM J. Sci. Comput.

View full text Add to dashboard Cite

Atmospheric weather and climate models must perform simulations very quickly to be useful. Therefore, modelers have traditionally focused on reducing computations as much as possible. However, in our new era of increasingly compute-capable hardware, data movement is now the prohibiting expense. This study examines the computational benefits of a new algorithmic approach to modeling atmospheric dynamics on scales relevant to weather and climate simulation. Rather than minimizing computations, this new approach considers the larger problem more holistically, including spatial accuracy, temporal accuracy, robustness (i.e., oscillations), on-node efficiency, and internode data transfers together at once. Numerical experiments demonstrate how computations can be strategically increased to simultaneously address each of these constraints while reducing data movement to adapt to modern accelerated hardware. The new algorithm can achieve at times up to 80\% peak floating point throughput in single precision on the Nvidia Tesla V100 GPU, where the traditional approach is shown to only achieve single-digit floating point efficiency. Further, the new algorithm is twice as fast as a standard Runge-Kutta time integrator, and high-order accuracy with Weighted Essentially Non-Oscillatory (WENO) limiting came at less than 30\% additional runtime cost on a GPU, thus increasing the accuracy per degree of freedom.

show abstract

Section: 12mentioning

confidence: 99%

A Holistic Algorithmic Approach to Improving Accuracy, Robustness, and Computational Efficiency for Atmospheric Dynamics

Norman¹,

Larkin²

2020

SIAM J. Sci. Comput.

View full text Add to dashboard Cite

show abstract

“…Over the years, the Classical Roofline model [36] has been formulated for multicore [19,23] and GPU [18,40] architectures. Moreover, assisted methodologies and automatic tools [5,22,28,29] have been introduced to ease Roofline model generation for scientific and HPC application optimization.…”

Section: Related Workmentioning

confidence: 99%

“…Condensing the optimization space in a single performance figure, this model provides intuitive guidance to optimize complex applications. In this way, the Roofline model has become a confirmed methodology to optimize HPC applications targeting multicore [19,23] and GPU [18,40] architectures. With Field-Programmable Gate Array (FPGA) devices becoming an appealing solutions to accelerate HPC applications, a dual Roofline model for reconfigurable devices is becoming of real interest.…”

Section: Introductionmentioning

confidence: 99%

A CAD-based methodology to optimize HLS code via the roofline model

Siracusa

Tucci²,

Rabozzi³

et al. 2020

Proceedings of the 39th International Conference on Computer-Aided Design

Self Cite

View full text Add to dashboard Cite

“…Today, nearly half of all the flops in the Top 500 supercomputers come from GPUs rather than CPUs, and that proportion continues to grow. In preparation for Perlmutter—the National Energy Research Scientific Computing Center's (NERSC's) upcoming NVIDIA GPU‐powered supercomputer—Yang, Kurth, and Williams present a methodology for constructing a hierarchical roofline model for NVIDIA GPUs. The model supports reduced precision and Tensor Cores, and the authors demonstrate its effectiveness in providing an understanding of performance bottlenecks in three proxy applications—GPP from BerkleyGW, HPGMB from AMReX, and conv2d from TensorFlow.…”

Section: Themes Of This Special Issuementioning

confidence: 99%