An efficient tensor transpose algorithm for multicore CPU, Intel Xeon Phi, and NVidia Tesla GPU

Lyakh, Dmitry I.

doi:10.1016/j.cpc.2014.12.013

Cited by 50 publications

(36 citation statements)

References 27 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…This effort was later extended to the evaluation of the

scriptO ((), N^{7})

terms in renormalized CCSD(T) and multireference CCSD(T) . More recently Lyakh developed a standalone general‐purpose tensor algebra library (TAL_SH) which supports basic tensor algebra on multicore CPU, many‐core, and NVIDIA GPU‐containing shared‐memory computers; although TAL_SH includes custom CUDA kernels for tensor operations, vendor provided BLAS is recommended to reach high performance. Yet another effort to produce optimized device kernels for tensor algebra was described by Kim et al who considered specifically the tensor contractions that appear in the (T) energy.…”

Section: Prior Workmentioning

confidence: 99%

Coupled‐cluster singles, doubles and perturbative triples with density fitting approximation for massively parallel heterogeneous platforms

Peng

Calvin

Valeev

2019

Int J of Quantum Chemistry

View full text Add to dashboard Cite

A high‐performance implementation of the coupled‐cluster singles, doubles, and perturbative triples [CCSD(T)] is developed in the Massively Parallel Quantum Chemistry program. Novel features include: (1) reduced memory requirements via a density‐fitting (DF) CCSD implementation utilizing distributed lazy evaluation for tensors with more than two unoccupied indices and (2) the ability to utilize efficiently many‐core nodes (Intel Xeon Phi) and heterogeneous nodes with multiple NVIDIA GPUs on each node. All data that are greater than quadratic in the system size are distributed among processes. Excellent strong scaling is observed on distributed‐memory computers equipped with conventional CPUs, Intel Xeon Phi processors, and heterogeneous nodes with multiple NVIDIA GPUs Canonical CCSD(T) energies can be evaluated for systems containing 200 electrons and 1000 basis functions in a few days using a small size commodity cluster, with even larger computations possible on leadership‐class computing resources.

show abstract

“…This effort was later extended to the evaluation of the

scriptO ((), N^{7})

Section: Prior Workmentioning

confidence: 99%

Coupled‐cluster singles, doubles and perturbative triples with density fitting approximation for massively parallel heterogeneous platforms

Peng

Calvin

Valeev

2019

Int J of Quantum Chemistry

View full text Add to dashboard Cite

show abstract

“…The tensor contraction operation in TAL-SH is implemented by the general Transpose-Transpose-GEMM-Transpose (TTGT) algorithm, with an optimized GPU tensor transpose operation [57] (see also Ref. [58]).…”

Section: E Out-of-core Asynchronous Execution Of Tensor Contractionsmentioning

confidence: 99%

Establishing the quantum supremacy frontier with a 281 Pflop/s simulation

Villalonga

Lyakh

Boixo

et al. 2020

Quantum Sci. Technol.

151

113

View full text Add to dashboard Cite

Noisy Intermediate-Scale Quantum (NISQ) computers aim to perform computational tasks beyond the capabilities of the most powerful classical computers, thereby achieving "Quantum Supremacy", a major milestone in quantum computing. NISQ Supremacy requires comparison with a state-of-the-art classical simulator. We report HPC simulations of hard random quantum circuits (RQC), sustaining an average performance of 281 Pflop/s (true single precision) on Summit, currently the fastest supercomputer in the world. In addition, we propose a standard benchmark for NISQ computers based on qFlex, a tensor-network-based classical high-performance simulator of RQC, which are considered the leading proposal for Quantum Supremacy.

show abstract

“…Lyakh et al [12] designed a generic multidimensional transpose algorithm and evaluated it across different architectures (e.g., Intel Xeon, Intel Xeon Phi, AMD and NVIDIA K20X). Despite the fact that their algorithm outperforms a naive baseline implementation, the results suggest that there still exists a noticeable performance gap to the bandwidth attained by a direct copy.…”

Section: Arxiv:170404374v2 [Csms] 10 May 2017mentioning

confidence: 99%

HPTT: a high-performance tensor transposition C++ library

Springer

Bientinesi

2017

Proceedings of the 4th ACM SIGPLAN International Workshop on Libraries, Languages, and Compilers for Array Programming

View full text Add to dashboard Cite

Recently we presented TTC, a domain-specific compiler for tensor transpositions. Despite the fact that the performance of the generated code is nearly optimal, due to its offline nature, TTC cannot be utilized in all the application codes in which the tensor sizes and the necessary tensor permutations are determined at runtime. To overcome this limitation, we introduce the open-source C++ library High-Performance Tensor Transposition (HPTT). Similar to TTC, HPTT incorporates optimizations such as blocking, multithreading, and explicit vectorization; furthermore it decomposes any transposition into multiple loops around a so called microkernel. This modular design-inspired by BLIS-makes HPTT easy to port to different architectures, by only replacing the handvectorized micro-kernel (e.g., a 4 × 4 transpose). HPTT also offers an optional autotuning framework-guided by performance heuristics-that explores a vast search space of implementations at runtime (similar to FFTW). Across a wide range of different tensor transpositions and architectures (e.g., Intel Ivy Bridge, ARMv7, IBM Power7), HPTT attains a bandwidth comparable to that of SAXPY, and yields remarkable speedups over Eigen's tensor transposition implementation. Most importantly, the integration of HPTT into the Cyclops Tensor Framework (CTF) improves the overall performance of tensor contractions by up to 3.1×.

show abstract

An efficient tensor transpose algorithm for multicore CPU, Intel Xeon Phi, and NVidia Tesla GPU

Cited by 50 publications

References 27 publications

Coupled‐cluster singles, doubles and perturbative triples with density fitting approximation for massively parallel heterogeneous platforms

Coupled‐cluster singles, doubles and perturbative triples with density fitting approximation for massively parallel heterogeneous platforms

Establishing the quantum supremacy frontier with a 281 Pflop/s simulation

HPTT: a high-performance tensor transposition C++ library

Contact Info

Product

Resources

About