Joel Falcou scite author profile

Abstract.We describe an algorithm for computing an inverse spherical harmonic transform suitable for graphic processing units (GPU). We use CUDA and base our implementation on a Fortran90 routine included in a publicly available parallel package, s 2 hat. We focus our attention on the two major sequential steps involved in the transforms computation, retaining the efficient parallel framework of the original code. We detail optimization techniques used to enhance the performance of the CUDA-based code and contrast them with those implemented in the Fortran90 version. We also present performance comparisons of a single CPU plus GPU unit with the s 2 hat code running on either a single or 4 processors. In particular we find that use of the latest generation of GPUs, such as NVIDIA GF100 (Fermi), can accelerate the spherical harmonic transforms by as much as 18 times with respect to s 2 hat executed on one core, and by as much as 5.5 with respect to s 2 hat on 4 cores, with the overall performance being limited by the Fast Fourier transforms.The work presented here has been performed in the context of the Cosmic Microwave Background simulations and analysis. However, we expect that the developed software will be of more general interest and applicability.1. Introduction. Spherical harmonic transforms are ubiquitous in diverse areas of science and practical applications, which need to deal with data distributed on a sphere. In particular, they are heavily used in various areas of cosmology, such as studies of the cosmic microwave background (CMB) radiation and its anisotropies, which have been our main motivations for this work. CMB is an electromagnetic radiation left over after the hot and very dense stage of early evolution of our Universe. The CMB measurements allow us to look back directly at the Universe when its age was only a small fraction (∼ 3%) of its current one (∼ 13Gyrs), and indirectly to learn about its status as far back as to ∼ 10 −35 sec after its nominal beginning (so called Big Bang). Not surprisingly, the CMB measurements play a vital role in the present-day cosmology and have been a driving force behind turning it into a high precision, data-driven science it is today.The CMB radiation is nearly isotropic but minute deviations, on order of 1 part in 10 5 , were first theoretically predicted and later detected. These so-called anisotropies encode the information about the Universe, its past and composition, and their detection and characterization has the major target of the CMB observations since the moment of its discovery in 1965. Over the time progressively more sophisticated and advanced observational apparata have been designed and deployed in search for their more subtle and taletelling characteristics. These include three major CMB satellites -American: Cosmic Microwave Background Explorer (COBE) [13], Wilkinson Microwave Anisotropy Probe (WMAP) [2], and European Planck 1 -and a few dozen of ground-based and balloon-borne projects. Some of these are operating at this time,

show abstract

High-Performance Matrix-Matrix Multiplications of Very Small Matrices

Masliah

Abdelfattah

Haidar

et al. 2016

View full text Add to dashboard Cite

The use of the general dense matrix-matrix multiplication (GEMM) is fundamental for obtaining high performance in many scientific computing applications. GEMMs for small matrices (of sizes less than 32) however, are not sufficiently optimized in existing libraries. In this paper we consider the case of many small GEMMs for a wide range of computer architectures, including multicore CPUs, ARM, Intel Xeon Phi, and GPUs. This is a case that often occurs in applications like big data analytics, machine learning, high-order FEM, and others. The GEMMs are grouped together in a single batched routine. We present specialized for these cases algorithms and optimization techniques to obtain performance that is within 90% of the optimal. For example, on a P100 GPU for square matrices of size 32, we achieve an execution rate of about 1, 030 Gflop/s in double precision arithmetic, which is 90% of the theoretically derived peak for this computation on a P100 GPU. We show that our results outperform currently available state-of-the-art implementations and vendor-tuned math libraries, including Intel MKL, Nvidia CUBLAS, and OpenBLAS.

show abstract

scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.

Contact Info

customersupport@researchsolutions.com

10624 S. Eastern Ave., Ste. A-614

Henderson, NV 89052, USA

This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.

Blog Terms and Conditions API Terms Privacy Policy Contact Cookie Preferences Do Not Sell or Share My Personal Information

Made with 💙 for researchers

Part of the Research Solutions Family.

Joel Falcou

High-performance Tensor Contractions for GPUs

Quaff: efficient C++ design for parallel skeletons

Spherical Harmonic Transform with GPUs

High-Performance Matrix-Matrix Multiplications of Very Small Matrices

Contact Info

Product

Resources

About