Auto-tuning 3-D FFT library for CUDA GPUs

This paper develops and evaluates search and optimization techniques for auto-tuning 3D stencil (nearest-neighbor) computations on GPUs. Observations indicate that parameter tuning is necessary for heterogeneous GPUs to achieve optimal performance with respect to a search space. Our proposed framework takes a most concise specification of stencil behavior from the user as a single formula, auto-generates tunable code from it, systematically searches for the best configuration and generates the code with optimal parameter configurations for different GPUs. This auto-tuning approach guarantees adaptive performance for different generations of GPUs while greatly enhancing programmer productivity. Experimental results show that the delivered floating point performance is very close to previous handcrafted work and outperforms other auto-tuned stencil codes by a large margin.

show abstract

“…Several CUDA implementations for linear algebra subroutines and FFTs with auto-tuning capability already exist [7,12,19].…”

Section: Related Workmentioning

confidence: 99%

Auto-generation and auto-tuning of 3D stencil codes on GPU clusters

Zhang

Mueller

2012

Proceedings of the Tenth International Symposium on Code Generation and Optimization

106

View full text Add to dashboard Cite

show abstract

“…Relative to this work, our contribution is to show how to do fast predictive auto-tuning that satisfies the requirements to: (a) handle the variety of recent multicore architectures like GPUs [Schaa and Kaeli, 2009], (b) provide high-performance domain-specific libraries [Nukada and Matsuoka, 2009, Li et al, 2009, Kamil et al, 2010, (c) that select good implementations at run-time [Klöckner et al, 2011, Pinto andCox, 2012], and (d) for the full input domain of a library routine [Liu et al, 2009, Grauer-Gray and.…”

Section: Auto-tuningmentioning

confidence: 99%

Machine learning for predictive auto-tuning with boosted regression trees

Bergstra

Pinto

Cox

2012

2012 Innovative Parallel Computing (InPar)

View full text Add to dashboard Cite

The rapidly evolving landscape of multicore architectures makes the construction of efficient libraries a daunting task. A family of methods known collectively as "auto-tuning" has emerged to address this challenge. Two major approaches to auto-tuning are empirical and model-based: empirical autotuning is a generic but slow approach that works by measuring runtimes of candidate implementations, model-based auto-tuning predicts those runtimes using simplified abstractions designed by hand. We show that machine learning methods for non-linear regression can be used to estimate timing models from data, capturing the best of both approaches. A statistically-derived model offers the speed of a model-based approach, with the generality and simplicity of empirical auto-tuning. We validate our approach using the filterbank correlation kernel described in Pinto and Cox [2012], where we find that 0.1 seconds of hill climbing on the regression model ("predictive auto-tuning") can achieve almost the same speed-up as is brought by minutes of empirical auto-tuning. Our approach is not specific to filterbank correlation, nor even to GPU kernel auto-tuning, and can be applied to almost any templated-code optimization problem, spanning a wide variety of problem types, kernel types, and platforms.

show abstract

“…As shown in [16] for example, it can be far more beneficial to recompute large segments of constant values instead of fetching them from main memory. Others [8] show that, in some cases, the most direct algorithm can outperform the CPU optimized one. Another source of performance loss is thread divergence due to asymmetrical branching in control flow.…”

Section: Mpi Parallelismmentioning

confidence: 99%

Spherical Harmonic Transform with GPUs

Hupca¹,

Falcou²,

Grigori³

et al. 2012

Euro-Par 2011: Parallel Processing Workshops

View full text Add to dashboard Cite

Abstract.We describe an algorithm for computing an inverse spherical harmonic transform suitable for graphic processing units (GPU). We use CUDA and base our implementation on a Fortran90 routine included in a publicly available parallel package, s 2 hat. We focus our attention on the two major sequential steps involved in the transforms computation, retaining the efficient parallel framework of the original code. We detail optimization techniques used to enhance the performance of the CUDA-based code and contrast them with those implemented in the Fortran90 version. We also present performance comparisons of a single CPU plus GPU unit with the s 2 hat code running on either a single or 4 processors. In particular we find that use of the latest generation of GPUs, such as NVIDIA GF100 (Fermi), can accelerate the spherical harmonic transforms by as much as 18 times with respect to s 2 hat executed on one core, and by as much as 5.5 with respect to s 2 hat on 4 cores, with the overall performance being limited by the Fast Fourier transforms.The work presented here has been performed in the context of the Cosmic Microwave Background simulations and analysis. However, we expect that the developed software will be of more general interest and applicability.1. Introduction. Spherical harmonic transforms are ubiquitous in diverse areas of science and practical applications, which need to deal with data distributed on a sphere. In particular, they are heavily used in various areas of cosmology, such as studies of the cosmic microwave background (CMB) radiation and its anisotropies, which have been our main motivations for this work. CMB is an electromagnetic radiation left over after the hot and very dense stage of early evolution of our Universe. The CMB measurements allow us to look back directly at the Universe when its age was only a small fraction (∼ 3%) of its current one (∼ 13Gyrs), and indirectly to learn about its status as far back as to ∼ 10 −35 sec after its nominal beginning (so called Big Bang). Not surprisingly, the CMB measurements play a vital role in the present-day cosmology and have been a driving force behind turning it into a high precision, data-driven science it is today.The CMB radiation is nearly isotropic but minute deviations, on order of 1 part in 10 5 , were first theoretically predicted and later detected. These so-called anisotropies encode the information about the Universe, its past and composition, and their detection and characterization has the major target of the CMB observations since the moment of its discovery in 1965. Over the time progressively more sophisticated and advanced observational apparata have been designed and deployed in search for their more subtle and taletelling characteristics. These include three major CMB satellites -American: Cosmic Microwave Background Explorer (COBE) [13], Wilkinson Microwave Anisotropy Probe (WMAP) [2], and European Planck 1 -and a few dozen of ground-based and balloon-borne projects. Some of these are operating at this time,

show abstract

Auto-tuning 3-D FFT library for CUDA GPUs

Cited by 106 publications

References 9 publications

Auto-generation and auto-tuning of 3D stencil codes on GPU clusters

Auto-generation and auto-tuning of 3D stencil codes on GPU clusters

Machine learning for predictive auto-tuning with boosted regression trees

Spherical Harmonic Transform with GPUs

Contact Info

Product

Resources

About