Optimization of geometric multigrid for emerging multi- and manycore processors

Williams, Samuel; Kalamkar, Dhiraj D.; Singh, Amik; Deshpande, Ashok; Straalen, Brian Van; Smelyanskiy, Mikhail; Almgren, Ann S.; Dubey, Pradeep; Shalf, John; Oliker, Leonid

doi:10.1109/sc.2012.85

Cited by 56 publications

(67 citation statements)

References 19 publications

Supporting

Mentioning

Contrasting

Unclassified

Order By: Relevance

“…We then present application results for the following important stencil-based applications: fluid animation from the PARSEC benchmark suite [3], geometric multi-grid calculations (GMG) [40], seismic wave propagation simulation (RTM) [23], the SOBEL filter used extensively for image processing [10], and a collection of Laplacian stencil kernels [18]. For the application results, we model Intel Phi co-processors.…”

Section: Evaluation 41 Methodologymentioning

confidence: 99%

Collective memory transfers for multi-core chips

Williams

Shalf

2014

Proceedings of the 28th ACM International Conference on Supercomputing

Self Cite

View full text Add to dashboard Cite

Future performance improvements for microprocessors have shifted from clock frequency scaling towards increases in on-chip parallelism. Performance improvements for a wide variety of parallel applications require domain-decomposition of data arrays from a contiguous arrangement in memory to a tiled layout for on-chip L1 data caches and scratchpads. However, DRAM performance suffers under the non-streaming access patterns generated by many independent cores. We propose collective memory scheduling (CMS) that actively takes control of collective memory transfers such that requests arrive in a sequential and predictable fashion to the memory controller. CMS uses the hierarchically tiled arrays formalism to compactly express collective operations, which greatly improves programmability over conventional prefetch or list-DMA approaches. CMS reduces application execution time by up to 32% and DRAM read power by 2.2×, compared to a baseline DMA architecture such as STI Cell.

show abstract

Section: Evaluation 41 Methodologymentioning

confidence: 99%

Collective memory transfers for multi-core chips

Williams

Shalf

2014

Proceedings of the 28th ACM International Conference on Supercomputing

Self Cite

View full text Add to dashboard Cite

show abstract

“…We then present application results for the following important stencil-based applications: fluid animation from the PARSEC benchmark suite [6], geometric multi-grid calculations (GMG) [73], seismic wave propagation simulation (RTM) [46], the SOBEL filter used extensively for image processing [23], and a collection of Laplacian stencil kernels [35]. For the application results, we model Intel Phi co-processors, which are simple x86-based processors and representative of the simple cores projected for future many-core chips [9,68].…”

Section: Methodsmentioning

confidence: 99%

Collective Memory Transfers for Multi-Core Chips

Williams¹,

Shalf²

2013

Self Cite

View full text Add to dashboard Cite

“…Thus, each CUDA thread computes 64 output grid points (in the k dimension), and each thread block computes 32(TX)*16(TY)*64 = 32,768 output points. The optimized smooth in miniGMG uses a 3D grid of dimension {BX=2, BY= 16, BZ=64}, and 2D thread blocks {TX=32, TY= 4} [32].…”

Section: Parallel Decomposition Of Gmgmentioning

confidence: 99%

Compiler-based code generation and autotuning for geometric multigrid on GPU-accelerated supercomputers

et al. 2017

Self Cite

View full text Add to dashboard Cite

GPUs, with their high bandwidths and computational capabilities are an increasingly popular target for scientific computing. Unfortunately, to date, harnessing the power of the GPU has required use of a GPU-specific programming model like CUDA, OpenCL, or OpenACC. As such, in order to deliver portability across CPU-based and GPU-accelerated supercomputers, programmers are forced to write and maintain two versions of their applications or frameworks. In this paper, we explore the use of a compiler-based autotuning framework based on CUDA-CHiLL to deliver not only portability, but also performance portability across CPU-and GPU-accelerated platforms for the geometric multigrid linear solvers found in many scientific applications. We show that with autotuning we can attain near Roofline (a performance bound for a computation and target architecture) performance across the key operations in the miniGMG benchmark for both CPU-and GPU-based architectures as well as for a multiple stencil discretizations and smoothers. We show that our technology is readily interoperable with MPI resulting in performance at scale equal to that obtained via hand-optimized MPI+CUDA implementation.

show abstract

Optimization of geometric multigrid for emerging multi- and manycore processors

Cited by 56 publications

References 19 publications

Collective memory transfers for multi-core chips

Collective memory transfers for multi-core chips

Collective Memory Transfers for Multi-Core Chips

Compiler-based code generation and autotuning for geometric multigrid on GPU-accelerated supercomputers

Contact Info

Product

Resources

About