We present "GEMM-like Tensor-Tensor multiplication" (GETT), a novel approach for dense tensor contractions that mirrors the design of a high-performance general matrix-matrix multiplication (GEMM). The critical insight behind GETT is the identification of three index sets, involved in the tensor contraction, which enable us to systematically reduce an arbitrary tensor contraction to loops around a highly tuned "macro-kernel". This macro-kernel operates on suitably prepared ("packed") sub-tensors that reside in a specified level of the cache hierarchy. In contrast to previous approaches to tensor contractions, GETT exhibits desirable features such as unit-stride memory accesses, cache-awareness, as well as full vectorization, without requiring auxiliary memory. We integrate GETT alongside the so called TransposeTranspose-GEMM-Transpose and Loops-over-GEMM approaches into an open source "Tensor Contraction Code Generator" (TCCG). The performance results for a wide range of tensor contractions suggest that GETT has the potential of becoming the method of choice: While GETT exhibits excellent performance across the board, its effectiveness for bandwidth-bound tensor contractions is especially impressive, outperforming existing approaches by up to 12.4×. More precisely, GETT achieves speedups of up to 1.41× over an equivalent-sized GEMM for bandwidth-bound tensor contractions while attaining up to 91.3% of peak floating-point performance for compute-bound tensor contractions.
Recently we presented TTC, a domain-specific compiler for tensor transpositions. Despite the fact that the performance of the generated code is nearly optimal, due to its offline nature, TTC cannot be utilized in all the application codes in which the tensor sizes and the necessary tensor permutations are determined at runtime. To overcome this limitation, we introduce the open-source C++ library High-Performance Tensor Transposition (HPTT). Similar to TTC, HPTT incorporates optimizations such as blocking, multithreading, and explicit vectorization; furthermore it decomposes any transposition into multiple loops around a so called microkernel. This modular design-inspired by BLIS-makes HPTT easy to port to different architectures, by only replacing the handvectorized micro-kernel (e.g., a 4 × 4 transpose). HPTT also offers an optional autotuning framework-guided by performance heuristics-that explores a vast search space of implementations at runtime (similar to FFTW). Across a wide range of different tensor transpositions and architectures (e.g., Intel Ivy Bridge, ARMv7, IBM Power7), HPTT attains a bandwidth comparable to that of SAXPY, and yields remarkable speedups over Eigen's tensor transposition implementation. Most importantly, the integration of HPTT into the Cyclops Tensor Framework (CTF) improves the overall performance of tensor contractions by up to 3.1×.
Abstract. The QCD phase diagram at finite temperature and density has attracted considerable interest over many decades now, not least because of its relevance for a better understanding of heavy-ion collision experiments. Models provide some insight into the QCD phase structure but usually rely on various parameters. Based on renormalization group arguments, we discuss how the parameters of QCD low-energy models can be determined from the fundamental theory of the strong interaction. We particularly focus on a determination of the temperature dependence of these parameters in this work and comment on the effect of a finite quark chemical potential. We present first results and argue that our findings can be used to improve the predictive power of future model calculations.
We have extended the multilevel summation (MLS) method, originally developed to evaluate long-range Coulombic interactions in molecular dynamics simulations [R. D. Skeel, I. Tezcan, and D. J. Hardy, J. Comput. Chem. 23, 673 (2002)], to handle dispersion interactions. While dispersion potentials are formally short-ranged, accurate calculation of forces and energies in interfacial and inhomogeneous systems require long-range methods. The MLS method offers some significant advantages compared to the particle-particle particle-mesh and smooth particle mesh Ewald methods. Unlike mesh-based Ewald methods, MLS does not use fast Fourier transforms and is thus not limited by communication and bandwidth concerns. In addition, it scales linearly in the number of particles, as compared with the O(NlogN) complexity of the mesh-based Ewald methods. While the structure of the MLS method is invariant for different potentials, every algorithmic step had to be adapted to accommodate the r(-6) form of the dispersion interactions. In addition, we have derived error bounds, similar to those obtained by Hardy ["Multilevel summation for the fast evaluation of forces for the simulation of biomolecules," Ph.D. thesis, University of Illinois at Urbana-Champaign, 2006] for the electrostatic MLS. Using a prototype implementation, we have demonstrated the linear scaling of the MLS method for dispersion, and present results establishing the accuracy and efficiency of the method.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.