Generating Efficient Tensor Contractions for GPUs

Nelson, Thomas; Rivera, Axel; Balaprakash, Prasanna; Hall, Mary; Hovland, Paul; Jessup, Elizabeth R.; Norris, Boyana

doi:10.1109/icpp.2015.106

Cited by 33 publications

(24 citation statements)

References 22 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…[Einstein 1916] "We therefore introduce the following rule: when an index appears twice in a term of an expression, one shall always sum over it, unless the opposite is noted explicitly." (authors' translation) In addition to the importance of Einstein's convention for mathematical notation, it serves as the basis for elegant domain-specific languages [Åhlander 2002;Nelson et al 2015;Solomonik et al 2013]. For example, in the Cyclops tensor framework one may write [Solomonik et al 2013]:…”

Section: High-level Language and Representationmentioning

confidence: 99%

“…Existing open-source software packages focus on binary tensor contractions [Li et al 2015;Matthews 2018;Shi et al 2016;Solomonik et al 2013;Springer and Bientinesi 2018], GPUs [Nelson et al 2015], only support tensors up to order 2 (matrices) [Spampinato et al 2018;Spampinato and Püschel 2014;Uphoff and Bader 2016], or focus on loop transformations [Kempf et al 2018;Luporini et al 2015;Stock et al 2011], where the latter lack support for sparse matrices in elementlocal operators and are to our understanding not designed for use with code generators for small GEMMs.…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Yet Another Tensor Toolbox for Discontinuous Galerkin Methods and Other Applications

Uphoff

Bäder

2020

ACM Trans. Math. Softw.

View full text Add to dashboard Cite

The numerical solution of partial differential equations is at the heart of many grand challenges in supercomputing. Solvers based on high-order discontinuous Galerkin (DG) discretisation have been shown to scale on large supercomputers with excellent performance and efficiency if the implementation exploits all levels of parallelism and is tailored to the specific architecture. However, every year new supercomputers emerge and the list of hardware-specific considerations grows simultaneously with the list of desired features in a DG code. Thus, we believe that a sustainable DG code needs an abstraction layer to implement the numerical scheme in a suitable language. We explore the possibility to abstract the numerical scheme as small tensor operations, describe them in a domain-specific language (DSL) resembling the Einstein notation, and to map them to small General Matrix-Matrix Multiplication routines. The compiler for our DSL implements classic optimisations that are used for large tensor contractions, and we present novel optimisation techniques such as equivalent sparsity patterns and optimal index permutations for temporary tensors. Our application examples, which include the earthquake simulation software SeisSol, show that the generated kernels achieve over 50% peak performance of a recent 48-core Skylake system while the DSL considerably simplifies the implementation. CCS Concepts: • Computing methodologies → Massively parallel and high-performance simulations; • Software and its engineering → Source code generation; Domain specific languages; • Applied computing → Earth and atmospheric sciences;

show abstract

Section: High-level Language and Representationmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

Yet Another Tensor Toolbox for Discontinuous Galerkin Methods and Other Applications

Uphoff

Bäder

2020

ACM Trans. Math. Softw.

View full text Add to dashboard Cite

show abstract

“…Recently, GPUs have been increasingly adopted to accelerate diverse tensor computations. Some works focused on accelerating specific tensor operations including tensor contraction [25,26], factorization [27], transpose [28,29], and tensor-matrix multiplication [30]. These works propose parallel tensor algorithms specifically optimized for the GPU architectures.…”

Section: Related Workmentioning

confidence: 99%

Efficient Tensor Sensing for RF Tomographic Imaging on GPUs

Zhang

2019

Future Internet

View full text Add to dashboard Cite

Radio-frequency (RF) tomographic imaging is a promising technique for inferring multi-dimensional physical space by processing RF signals traversed across a region of interest. Tensor-based approaches for tomographic imaging are superior at detecting the objects within higher dimensional spaces. The recently-proposed tensor sensing approach based on the transform tensor model achieves a lower error rate and faster speed than the previous tensor-based compress sensing approach. However, the running time of the tensor sensing approach increases exponentially with the dimension of tensors, thus not being very practical for big tensors. In this paper, we address this problem by exploiting massively-parallel GPUs. We design, implement, and optimize the tensor sensing approach on an NVIDIA Tesla GPU and evaluate the performance in terms of the running time and recovery error rate. Experimental results show that our GPU tensor sensing is as accurate as the CPU counterpart with an average of 44.79 × and up to 84.70 × speedups for varying-sized synthetic tensor data. For IKEA Model 3D model data of a smaller size, our GPU algorithm achieved 15.374× speedup over the CPU tensor sensing. We further encapsulate the GPU algorithm into an open-source library, called cuTensorSensing (CUDA Tensor Sensing), which can be used for efficient RF tomographic imaging.

show abstract

“…However, they focus on optimizing only limited number of tensor contraction kernels on extreme small size tensors. Other works in [1] [20] improve the tensor computation performance by doing loop reorganization and fusion.…”

Section: Introduction and Scopementioning

confidence: 99%

Tensor Contractions with Extended BLAS Kernels on CPU and GPU

Shi

Niranjan

Anandkumar

et al. 2016

2016 IEEE 23rd International Conference on High Performance Computing (HiPC)

View full text Add to dashboard Cite

Abstract-Tensor contractions constitute a key computational ingredient of numerical multi-linear algebra. However, as the order and dimension of tensors grow, the time and space complexities of tensor-based computations grow quickly. In this paper, we propose and evaluate new BLAS-like primitives that are capable of performing a wide range of tensor contractions on CPU and GPU efficiently. We begin by focusing on singleindex contractions involving all the possible configurations of second-order and third-order tensors. Then, we discuss extensions to more general cases.Existing approaches for tensor contractions spend large amounts of time restructuring the data which typically involves explicit copy and transpose operations. In this work, we summarize existing approaches and present library-based approaches that avoid memory movement. Through systematic benchmarking, we demonstrate that our approach can achieve 10x speedup on a K40c GPU and 2x speedup on dual-socket Haswell-EP CPUs, using MKL and CUBLAS respectively, for small and moderate tensor sizes. This is relevant in many machine learning applications such as deep learning, where tensor sizes tend to be small, but require numerous tensor contraction operations to be performed successively. Concretely, we implement a Tucker decomposition and show that using our kernels yields atleast an order of magnitude speedup as compared to state-of-the-art libraries.

show abstract

Generating Efficient Tensor Contractions for GPUs

Cited by 33 publications

References 22 publications

Yet Another Tensor Toolbox for Discontinuous Galerkin Methods and Other Applications

Yet Another Tensor Toolbox for Discontinuous Galerkin Methods and Other Applications

Efficient Tensor Sensing for RF Tomographic Imaging on GPUs

Tensor Contractions with Extended BLAS Kernels on CPU and GPU

Contact Info

Product

Resources

About