Analysis and tuning of libtensor framework on multicore architectures

Ibrahim, Khaled Z.; Williams, Samuel; Epifanovsky, Evgeny; Krylov, Anna I.

doi:10.1109/hipc.2014.7116881

Cited by 10 publications

(12 citation statements)

References 12 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Our work on graph optimization builds on substantial efforts for optimization of computational graphs of tensor operations. Tensor contraction can be optimized via parallelization [22,23,41,49], efficient transposition [51], blocking [10,18,28,43], exploiting symmetry [15,48,49], and sparsity [22,24,32,39,39,47]. For complicated tensor graphs, specialized compilers like XLA [52] and TVM [8] rewrite the computational graph to optimize program execution and memory allocation on dedicated hardware.…”

Section: Previous Workmentioning

confidence: 99%

AutoHOOT

Solomonik

2020

Proceedings of the ACM International Conference on Parallel Architectures and Compilation Techniques

View full text Add to dashboard Cite

High-order optimization methods, including Newton's method and its variants as well as alternating minimization methods, dominate the optimization algorithms for tensor decompositions and tensor networks. These tensor methods are used for data analysis and simulation of quantum systems. In this work, we introduce Auto-HOOT, the first automatic differentiation (AD) framework targeting at high-order optimization for tensor computations. AutoHOOT takes input tensor computation expressions and generates optimized derivative expressions. In particular, AutoHOOT contains a new explicit Jacobian / Hessian expression generation kernel whose outputs maintain the input tensors' granularity and are easy to optimize. The expressions are then optimized by both the traditional compiler optimization techniques and specific tensor algebra transformations. Experimental results show that AutoHOOT achieves competitive CPU and GPU performance for both tensor decomposition and tensor network applications compared to existing AD software and other tensor computation libraries with manually written kernels. The tensor methods generated by AutoHOOT are also well-parallelizable, and we demonstrate good scalability on a distributed memory supercomputer.

show abstract

Section: Previous Workmentioning

confidence: 99%

AutoHOOT

Solomonik

2020

Proceedings of the ACM International Conference on Parallel Architectures and Compilation Techniques

View full text Add to dashboard Cite

show abstract

“…The tasking model incurs scheduling overheads at queueing or dequeueing tasks. Distribution of task granularities typically shows a wide variation with dominance of small tasks [20]. Achieving higher concurrency also involves using smaller blocks.…”

Section: Shared Memory Task-based Backendmentioning

confidence: 99%

“…The same approach is used in distributed memory models using the partitioned global address space (PGAS) abstraction, such as global arrays. Even in shared memory machines with non-uniform memory access (NUMA), such a cyclic distribution is essential for improving performance [20]. A more complex indexing, such as that used in CTF, allows a regular distribution of data using a mapping function between the tensor dimension and the processes based on a virtual layout.…”

Section: Tensor Data Distributionmentioning

confidence: 99%

See 1 more Smart Citation

Cross-scale efficient tensor contractions for coupled cluster computations through multiple programming model backends

Ibrahim

Epifanovsky

Williams

et al. 2017

Journal of Parallel and Distributed Computing

Self Cite

View full text Add to dashboard Cite

Coupled-cluster methods provide highly accurate models of molecular structure through explicit numerical calculation of tensors representing the correlation between electrons. These calculations are dominated by a sequence of tensor contractions, motivating the development of numerical libraries for such operations. While based on matrix-matrix multiplication, these libraries are specialized to exploit symmetries in the molecular structure and in electronic interactions, and thus reduce the size of the tensor representation and the complexity of contractions. The resulting algorithms are irregular and their parallelization has been previously achieved via the use of dynamic scheduling or specialized data decompositions. We introduce our efforts to extend the Libtensor framework to work in the distributed memory environment in a scalable and energy-efficient manner. We achieve up to 240× speedup compared with the optimized shared memory implementation of Libtensor. We attain scalability to hundreds of thousands of compute cores on three distributedmemory architectures, (Cray XC30 and XC40, and IBM Blue Gene/Q), and on a heterogeneous GPU-CPU system (Cray XK7). As the bottlenecks shift from being compute-bound DGEMM's to communication-bound collectives as the size of the molecular system scales, we adopt two radically different parallelization approaches for handling load-imbalance, tasking and bulk synchronous models. Nevertheless, we preserve a unified interface to both programming models to maintain the productivity of computational quantum chemists.

show abstract

“…Multi-dimensional tensors with symmetry are stored as a collection of fully dense "bricks" or data-tiles, where only distinct bricks are explicitly stored. The contraction of two tensors is implemented as a collection of contractions involving the set of bricks representing the tensor [7][8][9][10]. The tile sizes are chosen based on the available memory and to ensure efficient communication, maximize computation efficiency throughout the calculation, and enable dynamic load balancing.…”

Section: Introductionmentioning

confidence: 99%

Optimizing Tensor Contractions in CCSD(T) for Efficient Execution on GPUs

Kim

Sukumaran-Rajam

Hong

et al. 2018

Proceedings of the 2018 International Conference on Supercomputing

View full text Add to dashboard Cite

Tensor contractions are higher dimensional analogs of matrix multiplications, used in many computational contexts such as high order models in quantum chemistry, deep learning, finite element methods etc. In contrast to the wide availability of high-performance libraries for matrix multiplication on GPUs, the same is not true for tensor contractions. In this paper, we address the optimization of a set of symmetrized tensor contractions that form the computational bottleneck in the CCSD(T) coupled-cluster method in computational chemistry suites like NWChem. Some of the challenges in optimizing tensor contractions that arise in practice from the variety of dimensionalities and shapes for tensors include effective mapping of the high-dimensional iteration space to threads, choice of data buffering in shared-memory and registers, and tile sizes for multi-level tiling. Furthermore, in the case of symmetrized tensor contractions in CCSD(T), it is also a challenge to fuse contractions to reduce data movement cost by exploiting reuse of intermediate tensors. In this paper, we develop an efficient GPU implementation of the tensor contractions in CCSD(T) using shared-memory buffering, register tiling, loop fusion and register transpose. Experimental results demonstrate significant improvement over the current state-of-the-art.

show abstract

Analysis and tuning of libtensor framework on multicore architectures

Cited by 10 publications

References 12 publications

AutoHOOT

AutoHOOT

Cross-scale efficient tensor contractions for coupled cluster computations through multiple programming model backends

Optimizing Tensor Contractions in CCSD(T) for Efficient Execution on GPUs

Contact Info

Product

Resources

About