Paolo Bientinesi scite author profile

In this article we present a systematic approach to the derivation of families of high-performance algorithms for a large set of frequently encountered dense linear algebra operations. As part of the derivation a constructive proof of the correctness of the algorithm is generated. The article is structured so that it can be used as a tutorial for novices. However, the method has been shown to yield new high-performance algorithms for well-studied linear algebra operations and should also be of interest to those who wish to produce best-in-class high-performance codes.

show abstract

Design of a High-Performance GEMM-like Tensor–Tensor Multiplication

Springer

Bientinesi

2018

ACM Trans. Math. Softw.

View full text Add to dashboard Cite

We present "GEMM-like Tensor-Tensor multiplication" (GETT), a novel approach for dense tensor contractions that mirrors the design of a high-performance general matrix-matrix multiplication (GEMM). The critical insight behind GETT is the identification of three index sets, involved in the tensor contraction, which enable us to systematically reduce an arbitrary tensor contraction to loops around a highly tuned "macro-kernel". This macro-kernel operates on suitably prepared ("packed") sub-tensors that reside in a specified level of the cache hierarchy. In contrast to previous approaches to tensor contractions, GETT exhibits desirable features such as unit-stride memory accesses, cache-awareness, as well as full vectorization, without requiring auxiliary memory. We integrate GETT alongside the so called TransposeTranspose-GEMM-Transpose and Loops-over-GEMM approaches into an open source "Tensor Contraction Code Generator" (TCCG). The performance results for a wide range of tensor contractions suggest that GETT has the potential of becoming the method of choice: While GETT exhibits excellent performance across the board, its effectiveness for bandwidth-bound tensor contractions is especially impressive, outperforming existing approaches by up to 12.4×. More precisely, GETT achieves speedups of up to 1.41× over an equivalent-sized GEMM for bandwidth-bound tensor contractions while attaining up to 91.3% of peak floating-point performance for compute-bound tensor contractions.

show abstract

Dissecting the FEAST algorithm for generalized eigenproblems

Krämer

Napoli

Galgon

et al. 2013

Journal of Computational and Applied Mathematics

View full text Add to dashboard Cite

We analyze the FEAST method for computing selected eigenvalues and eigenvectors of large sparse matrix pencils. After establishing the close connection between FEAST and the well-known Rayleigh-Ritz method, we identify several critical issues that influence convergence and accuracy of the solver: the choice of the starting vector space, the stopping criterion, how the inner linear systems impact the quality of the solution, and the use of FEAST for computing eigenpairs from multiple intervals. We complement the study with numerical examples, and hint at possible improvements to overcome the existing problems.

show abstract

A Parallel Eigensolver for Dense Symmetric Matrices Based on Multiple Relatively Robust Representations

Bientinesi¹,

Dhillon²,

Geijn³

2005

SIAM J. Sci. Comput.

View full text Add to dashboard Cite

We present a new parallel algorithm for the dense symmetric eigenvalue/eigenvector problem that is based upon the tridiagonal eigensolver, Algorithm MR 3 , recently developed by Dhillon and Parlett. Algorithm MR 3 has a complexity of O(n 2) operations for computing all eigenvalues and eigenvectors of a symmetric tridiagonal problem. Moreover the algorithm requires only O(n) extra workspace and can be adapted to compute any subset of k eigenpairs in O(nk) time. In contrast, all earlier stable parallel algorithms for the tridiagonal eigenproblem require O(n 3) operations in the worst case, while some implementations, such as divide and conquer, have an extra O(n 2) memory requirement. The proposed parallel algorithm balances the workload equally among the processors by traversing a matrix-dependent representation tree which captures the sequence of computations performed by Algorithm MR 3. The resulting implementation allows problems of very large size to be solved efficiently-the largest dense eigenproblem solved in-core on a 256 processor machine with 2 GBytes of memory per processor is for a matrix of size 128,000 × 128,000, which required about 8 hours of CPU time. We present comparisons with other eigensolvers and results on matrices that arise in the applications of computational quantum chemistry and finite element modeling of automobile bodies.

show abstract

scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.

Contact Info

customersupport@researchsolutions.com

10624 S. Eastern Ave., Ste. A-614

Henderson, NV 89052, USA

This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.

Blog Terms and Conditions API Terms Privacy Policy Contact Cookie Preferences Do Not Sell or Share My Personal Information

Made with 💙 for researchers

Part of the Research Solutions Family.