Distributed Out-of-Memory SVD on CPU/GPU Architectures

Boureima, Ismael; Bhattarai, Manish; Eren, Maksim E.; Solovyev, Nick; Djidjev, Hristo; Alexandrov, Boian S.

doi:10.1109/hpec55821.2022.9926288

Cited by 4 publications

(2 citation statements)

References 16 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…When performing NMF on GPUs, OOM situations can arise in various scenarios with different degrees of complexity. As discussed in [34], we distinguish three main types of OOM scenarios. Scenarios of type 0 (OOM-0) concern practical problems where the input data A and its co-factors W and H can easily be stored on GPU memory.…”

Section: Rationale For An Algorithm For the Out-of-memory Distributed...mentioning

confidence: 99%

Distributed Out-of-Memory NMF on CPU/GPU Architectures

Boureima

Bhattarai

Eren

et al. 2023

Preprint

Self Cite

View full text Add to dashboard Cite

We propose an efficient distributed out-of-memory implementation of the Non-negative Matrix Factorization (NMF) algorithm for heterogeneous high-performance-computing (HPC) systems. The proposed implementation is based on prior work on NMFk, which can perform automatic model selection and extract latent variables and patterns from data. In this work, we extend NMFk by adding support for dense and sparse matrix operation on multi-node, multi-GPU systems. The resulting algorithm is optimized for out-of-memory (OOM) problems where the memory required to factorize a given matrix is greater than the available GPU memory. Memory complexity is reduced by batching/tiling strategies, and sparse and dense matrix operations are significantly accelerated with GPU cores (or tensor cores when available). Input/Output (I/O) latency associated with batch copies between host and device is hidden using CUDA streams to overlap data transfers and compute asynchronously, and latency associated with collective communications (both intra-node and inter-node) is reduced using optimized NVIDIA Collective Communication Library (NCCL) based communicators. Benchmark results show significant improvement, from 32X to 76x speedup, with the new implementation using GPUs over the CPU-based NMFk. Good weak scaling was demonstrated on up to 4096 multi-GPU cluster nodes with approximately 25,000 GPUs when decomposing a dense 340 Terabyte-size matrix and an 11 Exabyte-size sparse matrix of density 10^{-6}.

show abstract

Section: Rationale For An Algorithm For the Out-of-memory Distributed...mentioning

confidence: 99%

Distributed Out-of-Memory NMF on CPU/GPU Architectures

Boureima

Bhattarai

Eren

et al. 2023

Preprint

Self Cite

View full text Add to dashboard Cite

show abstract

“…However, each rank factorization is independent from one another. Therefore, future work can consider parallelization of this task, or a distributed version of this task utilizing High-Performance Computing (HPC) environments [12,13,16,17].…”

Section: Future Workmentioning

confidence: 99%

Semi-Supervised Classification of Malware Families Under Extreme Class Imbalance via Hierarchical Non-Negative Matrix Factorization with Automatic Model Selection

Eren,

Bhattarai,

Joyce

et al. 2023

ACM Trans. Priv. Secur.

View full text Add to dashboard Cite

Identification of the family to which a malware specimen belongs is essential in understanding the behavior of the malware and developing mitigation strategies. Solutions proposed by prior work, however, are often not practicable due to the lack of realistic evaluation factors. These factors include learning under class imbalance, the ability to identify new malware, and the cost of production-quality labeled data. In practice, deployed models face prominent, rare, and new malware families. At the same time, obtaining a large quantity of up-to-date labeled malware for training a model can be expensive. In this paper, we address these problems and propose a novel hierarchical semi-supervised algorithm, which we call the HNMFk Classifier , that can be used in the early stages of the malware family labeling process. Our method is based on non-negative matrix factorization with automatic model selection, that is, with an estimation of the number of clusters. With HNMFk Classifier , we exploit the hierarchical structure of the malware data together with a semi-supervised setup, which enables us to classify malware families under conditions of extreme class imbalance. Our solution can perform abstaining predictions, or rejection option, which yields promising results in the identification of novel malware families and helps with maintaining the performance of the model when a low quantity of labeled data is used. We perform bulk classification of nearly 2,900 both rare and prominent malware families, through static analysis, using nearly 388,000 samples from the EMBER-2018 corpus. In our experiments, we surpass both supervised and semi-supervised baseline models with an F1 score of 0.80.

show abstract

Distributed out-of-memory NMF on CPU/GPU architectures

Boureima,

Bhattarai,

Eren

et al. 2023

J Supercomput

View full text Add to dashboard Cite

We propose an efficient distributed out-of-memory implementation of the non-negative matrix factorization (NMF) algorithm for heterogeneous high-performance-computing systems. The proposed implementation is based on prior work on NMFk, which can perform automatic model selection and extract latent variables and patterns from data. In this work, we extend NMFk by adding support for dense and sparse matrix operation on multi-node, multi-GPU systems. The resulting algorithm is optimized for out-of-memory problems where the memory required to factorize a given matrix is greater than the available GPU memory. Memory complexity is reduced by batching/tiling strategies, and sparse and dense matrix operations are significantly accelerated with GPU cores (or tensor cores when available). Input/output latency associated with batch copies between host and device is hidden using CUDA streams to overlap data transfers and compute asynchronously, and latency associated with collective communications (both intra-node and inter-node) is reduced using optimized NVIDIA Collective Communication Library (NCCL) based communicators. Benchmark results show significant improvement, from 32X to 76x speedup, with the new implementation using GPUs over the CPU-based NMFk. Good weak scaling was demonstrated on up to 4096 multi-GPU cluster nodes with approximately 25,000 GPUs when decomposing a dense 340 Terabyte-size matrix and an 11 Exabyte-size sparse matrix of density $$10^{-6}$$ 10 - 6 .

show abstract

Distributed Out-of-Memory SVD on CPU/GPU Architectures

Cited by 4 publications

References 16 publications

Distributed Out-of-Memory NMF on CPU/GPU Architectures

Distributed Out-of-Memory NMF on CPU/GPU Architectures

Semi-Supervised Classification of Malware Families Under Extreme Class Imbalance via Hierarchical Non-Negative Matrix Factorization with Automatic Model Selection

Distributed out-of-memory NMF on CPU/GPU architectures

Contact Info

Product

Resources

About