Distributed-Memory Hierarchical Compression of Dense SPD Matrices

Yu, Chih‐Wei; Reiz, Severin; Biros, George

doi:10.1109/sc.2018.00018

Cited by 14 publications

(11 citation statements)

References 21 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Strong scaling: In Figure 2 (#1, #2, #3, #4, #5, #6, #7, #8, #9, #10, #11, and #12), we use a 2 24 -by-2 24 Gaussian kernel matrix generated with a synthetic 6-D point dataset. Using this matrix, we perform strong scaling experiments using up to 6,144 Skylake cores (128 compute nodes, using one MPI process per node and 48 OpenMP threads).…”

Section: Resultsmentioning

confidence: 99%

“…Our method is based on GOFMM [23], [24], which constructs a hierarchical matrix factorization for K, using only entries in K. (We review GOFMM in §II). In particular, GOFMM constructs a matrix K (hereby "compresses") using O(N log N ) entries from K such that K − K ≤ K (for a user-defined tolerance > 0) and a matvec with K requires as low as O(N ) work.…”

Section: Introductionmentioning

confidence: 99%

“…Contributions: Based on GOFMM [24] (which we summarize in § §II-A), we introduce an approximate factorization for dense matrices. As we mentioned, GOFMM supports both FMM and HSS approximations of K. The FMM, however, is much harder to factorize.…”

Section: Introductionmentioning

confidence: 99%

“…Unlike GOFMM, STRUMPACK does not permute K; such permutation is critical in order to improve the "approximability" of K by a HSS or FMM matrix. In [23], [24] we compared GOFMM with STRUMPACK and in most cases GOFMM was significantly faster. However, STRUMPACK also supports sparse matrices (with appropriate reordering), whereas GOFMM does not.…”

Section: Introductionmentioning

confidence: 99%

See 3 more Smart Citations

Distributed O(N) Linear Solver for Dense Symmetric Hierarchical Semi-Separable Matrices

Reiz

Biros

2019

2019 IEEE 13th International Symposium on Embedded Multicore/Many-Core Systems-on-Chip (MCSoC)

Self Cite

View full text Add to dashboard Cite

show abstract

Section: Resultsmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

Distributed O(N) Linear Solver for Dense Symmetric Hierarchical Semi-Separable Matrices

Reiz

Biros

2019

2019 IEEE 13th International Symposium on Embedded Multicore/Many-Core Systems-on-Chip (MCSoC)

Self Cite

View full text Add to dashboard Cite

show abstract

“…Xia [31] used a similar strategy for compressing blocks in a nested dissection solver, and the STRUMPACK package [27,14] also uses this randomized sampling strategy for compression of fronts into HSS form. Shared and distributed memory hierarchical compression of SPD matrices into a weak admissibility format are described in [33,34].…”

mentioning

confidence: 99%

Randomized GPU Algorithms for the Construction of Hierarchical Matrices from Matrix-Vector Operations

Boukaram¹,

Turkiyyah²,

Keyes³

2019

SIAM J. Sci. Comput.

View full text Add to dashboard Cite

Randomized algorithms for the generation of low rank approximations of large dense matrices have become popular methods in scientific computing and machine learning. In this paper, we extend the scope of these methods and present batched GPU randomized algorithms for the efficient generation of low rank representations of large sets of small dense matrices, as well as their generalization to the construction of hierarchically low rank symmetric H 2 matrices with general partitioning structures. In both cases, the algorithms need to access the matrices only through matrix-vector multiplication operations which can be done in blocks to increase the arithmetic intensity and substantially boost the resulting performance. The batched GPU kernels are adaptive, allow nonuniform sizes in the matrices of the batch, and are more effective than SVD factorizations on matrices with fast decaying spectra. The hierarchical matrix generation consists of two phases, interleaved at every level of the matrix hierarchy. A first phase adaptively generates low rank approximations of matrix blocks through randomized matrix-vector sampling. A second phase accumulates and compresses these blocks into a hierarchical matrix that is incrementally constructed. The accumulation expresses the low rank blocks of a given level as a set of local low rank updates that are performed simultaneously on the whole matrix allowing high-performance batched kernels to be used in the compression operations. When the ranks of the blocks generated in the first phase are too large to be processed in a single operation, the low rank updates can be split into smaller-sized updates and applied in sequence. Assuming representative rank k, the resulting matrix has optimal O(kN) asymptotic storage complexity because of the nested bases it uses. The ability to generate an H 2 matrix from matrix-vector products allows us to support a general randomized matrix-matrix multiplication operation, an important kernel in hierarchical matrix computations. Numerical experiments demonstrate the high performance of the algorithms and their effectiveness in generating hierarchical matrices to a desired target accuracy.

show abstract

Neural Nets with a Newton Conjugate Gradient Method on Multiple GPUs

Reiz¹,

Neckel²,

Bungartz³

2023

Lecture Notes in Computer Science

View full text Add to dashboard Cite

Training deep neural networks consumes increasing computational resource shares in many compute centers. Often, a brute force approach to obtain hyperparameter values is employed. Our goal is (1) to enhance this by enabling second-order optimization methods with fewer hyperparameters for large-scale neural networks and (2) to compare optimizers for specific tasks to suggest users the best one for their problem. We introduce a novel second-order optimization method that requires the effect of the Hessian on a vector only and avoids the huge cost of explicitly setting up the Hessian for large-scale networks.We compare the proposed second-order method with two state-of-the-art optimizers on five representative neural network problems, including regression and very deep networks from computer vision or variational autoencoders. For the largest setup, we efficiently parallelized the optimizers with Horovod and applied it to a 8 GPU NVIDIA A100 (DGX-1) machine with 80% parallel efficiency.

show abstract

Distributed-Memory Hierarchical Compression of Dense SPD Matrices

Cited by 14 publications

References 21 publications

Distributed O(N) Linear Solver for Dense Symmetric Hierarchical Semi-Separable Matrices

Distributed O(N) Linear Solver for Dense Symmetric Hierarchical Semi-Separable Matrices

Randomized GPU Algorithms for the Construction of Hierarchical Matrices from Matrix-Vector Operations

Neural Nets with a Newton Conjugate Gradient Method on Multiple GPUs

Contact Info

Product

Resources

About