Planc

Eswar, Srinivas; Hayashi, Koby; Ballard, Grey; Kannan, Ramakrishnan; Matheson, Michael A.; Park, Haesun

doi:10.1145/3432185

Cited by 9 publications

(2 citation statements)

References 46 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Almost all distributed GPU implementations including NMF-mGPU [28] and PLANC [33] rely on significant data communication for the update of the factors. This involves using CUDA-aware MPI primitives for data communication or MPI distributed memory offload through NVBLAS [33] without multi-node GPU communicators. Such implementation leads to high data movement costs due to data on-loading/offloading to/from the device, which significantly raises communication costs compared to the computation cost for large data decomposition.…”

Section: Related Work On Distributed Nmfmentioning

confidence: 99%

See 1 more Smart Citation

Distributed Out-of-Memory NMF on CPU/GPU Architectures

Boureima

Bhattarai

Eren

et al. 2023

Preprint

View full text Add to dashboard Cite

We propose an efficient distributed out-of-memory implementation of the Non-negative Matrix Factorization (NMF) algorithm for heterogeneous high-performance-computing (HPC) systems. The proposed implementation is based on prior work on NMFk, which can perform automatic model selection and extract latent variables and patterns from data. In this work, we extend NMFk by adding support for dense and sparse matrix operation on multi-node, multi-GPU systems. The resulting algorithm is optimized for out-of-memory (OOM) problems where the memory required to factorize a given matrix is greater than the available GPU memory. Memory complexity is reduced by batching/tiling strategies, and sparse and dense matrix operations are significantly accelerated with GPU cores (or tensor cores when available). Input/Output (I/O) latency associated with batch copies between host and device is hidden using CUDA streams to overlap data transfers and compute asynchronously, and latency associated with collective communications (both intra-node and inter-node) is reduced using optimized NVIDIA Collective Communication Library (NCCL) based communicators. Benchmark results show significant improvement, from 32X to 76x speedup, with the new implementation using GPUs over the CPU-based NMFk. Good weak scaling was demonstrated on up to 4096 multi-GPU cluster nodes with approximately 25,000 GPUs when decomposing a dense 340 Terabyte-size matrix and an 11 Exabyte-size sparse matrix of density 10^{-6}.

show abstract

Section: Related Work On Distributed Nmfmentioning

confidence: 99%

Section: Related Work On Distributed Nmfmentioning

confidence: 99%

Distributed Out-of-Memory NMF on CPU/GPU Architectures

Boureima

Bhattarai

Eren

et al. 2023

Preprint

View full text Add to dashboard Cite

show abstract

CP decomposition for tensors via alternating least squares with QR decomposition

Minster

Viviano

Liu

et al. 2023

Numerical Linear Algebra App

Self Cite

View full text Add to dashboard Cite

The CP tensor decomposition is used in applications such as machine learning and signal processing to discover latent low‐rank structure in multidimensional data. Computing a CP decomposition via an alternating least squares (ALS) method reduces the problem to several linear least squares problems. The standard way to solve these linear least squares subproblems is to use the normal equations, which inherit special tensor structure that can be exploited for computational efficiency. However, the normal equations are sensitive to numerical ill‐conditioning, which can compromise the results of the decomposition. In this paper, we develop versions of the CP‐ALS algorithm using the QR decomposition and the singular value decomposition, which are more numerically stable than the normal equations, to solve the linear least squares problems. Our algorithms utilize the tensor structure of the CP‐ALS subproblems efficiently, have the same complexity as the standard CP‐ALS algorithm when the input is dense and the rank is small, and are shown via examples to produce more stable results when ill‐conditioning is present. Our MATLAB implementation achieves the same running time as the standard algorithm for small ranks, and we show that the new methods can obtain lower approximation error.

show abstract

Distributed out-of-memory NMF on CPU/GPU architectures

Boureima,

Bhattarai,

Eren

et al. 2023

J Supercomput

View full text Add to dashboard Cite

We propose an efficient distributed out-of-memory implementation of the non-negative matrix factorization (NMF) algorithm for heterogeneous high-performance-computing systems. The proposed implementation is based on prior work on NMFk, which can perform automatic model selection and extract latent variables and patterns from data. In this work, we extend NMFk by adding support for dense and sparse matrix operation on multi-node, multi-GPU systems. The resulting algorithm is optimized for out-of-memory problems where the memory required to factorize a given matrix is greater than the available GPU memory. Memory complexity is reduced by batching/tiling strategies, and sparse and dense matrix operations are significantly accelerated with GPU cores (or tensor cores when available). Input/output latency associated with batch copies between host and device is hidden using CUDA streams to overlap data transfers and compute asynchronously, and latency associated with collective communications (both intra-node and inter-node) is reduced using optimized NVIDIA Collective Communication Library (NCCL) based communicators. Benchmark results show significant improvement, from 32X to 76x speedup, with the new implementation using GPUs over the CPU-based NMFk. Good weak scaling was demonstrated on up to 4096 multi-GPU cluster nodes with approximately 25,000 GPUs when decomposing a dense 340 Terabyte-size matrix and an 11 Exabyte-size sparse matrix of density $$10^{-6}$$ 10 - 6 .

show abstract

Planc

Cited by 9 publications

References 46 publications

Distributed Out-of-Memory NMF on CPU/GPU Architectures

Distributed Out-of-Memory NMF on CPU/GPU Architectures

CP decomposition for tensors via alternating least squares with QR decomposition

Distributed out-of-memory NMF on CPU/GPU architectures

Contact Info

Product

Resources

About