2015
DOI: 10.1016/j.cpc.2014.12.013
|View full text |Cite
|
Sign up to set email alerts
|

An efficient tensor transpose algorithm for multicore CPU, Intel Xeon Phi, and NVidia Tesla GPU

Abstract: a b s t r a c tAn efficient parallel tensor transpose algorithm is suggested for shared-memory computing units, namely, multicore CPU, Intel Xeon Phi, and NVidia GPU. The algorithm operates on dense tensors (multidimensional arrays) and is based on the optimization of cache utilization on x86 CPU and the use of shared memory on NVidia GPU. From the applied side, the ultimate goal is to minimize the overhead encountered in the transformation of tensor contractions into matrix multiplications in computer impleme… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
36
0

Year Published

2015
2015
2021
2021

Publication Types

Select...
5
3
1

Relationship

0
9

Authors

Journals

citations
Cited by 50 publications
(36 citation statements)
references
References 27 publications
0
36
0
Order By: Relevance
“…This effort was later extended to the evaluation of the scriptO()N7 terms in renormalized CCSD(T) and multireference CCSD(T) . More recently Lyakh developed a standalone general‐purpose tensor algebra library (TAL_SH) which supports basic tensor algebra on multicore CPU, many‐core, and NVIDIA GPU‐containing shared‐memory computers; although TAL_SH includes custom CUDA kernels for tensor operations, vendor provided BLAS is recommended to reach high performance. Yet another effort to produce optimized device kernels for tensor algebra was described by Kim et al who considered specifically the tensor contractions that appear in the (T) energy.…”
Section: Prior Workmentioning
confidence: 99%
“…This effort was later extended to the evaluation of the scriptO()N7 terms in renormalized CCSD(T) and multireference CCSD(T) . More recently Lyakh developed a standalone general‐purpose tensor algebra library (TAL_SH) which supports basic tensor algebra on multicore CPU, many‐core, and NVIDIA GPU‐containing shared‐memory computers; although TAL_SH includes custom CUDA kernels for tensor operations, vendor provided BLAS is recommended to reach high performance. Yet another effort to produce optimized device kernels for tensor algebra was described by Kim et al who considered specifically the tensor contractions that appear in the (T) energy.…”
Section: Prior Workmentioning
confidence: 99%
“…The tensor contraction operation in TAL-SH is implemented by the general Transpose-Transpose-GEMM-Transpose (TTGT) algorithm, with an optimized GPU tensor transpose operation [57] (see also Ref. [58]).…”
Section: E Out-of-core Asynchronous Execution Of Tensor Contractionsmentioning
confidence: 99%
“…Lyakh et al [12] designed a generic multidimensional transpose algorithm and evaluated it across different architectures (e.g., Intel Xeon, Intel Xeon Phi, AMD and NVIDIA K20X). Despite the fact that their algorithm outperforms a naive baseline implementation, the results suggest that there still exists a noticeable performance gap to the bandwidth attained by a direct copy.…”
Section: Arxiv:170404374v2 [Csms] 10 May 2017mentioning
confidence: 99%