An optimized tensor completion library for multiple GPUs

Dun, Ming; Li, Yunchun; Yang, Hailong; Sun, Qingxiao; Luan, Zhongzhi; Qian, Depei

doi:10.1145/3447818.3460692

Cited by 3 publications

(12 citation statements)

References 37 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Optimizing sparse MTTKRP has been the subject of several prior studies, which propose sparse tensor formats along with parallel algorithms to process/analyze the data. List-based formats, such as F-COO [30], GenTen [39], and TB-COO [12], explicitly store the multi-dimensional coordinates of each non-zero element. To reduce atomic operations, these formats store multiple mode-specific copies of the tensor and/or extra scheduling information, which substantially increases their memory footprint.…”

Section: Related Workmentioning

confidence: 99%

“…Segmented scan and reduction [42,52] have been used to reduce the synchronization cost of sparse workloads [6,12,29,51,53]. Prior studies apply these primitives to mode-specific formats with delineated and/or sorted groups of non-zero elements according to the target mode.…”

Section: Related Workmentioning

confidence: 99%

“…The state-of-the-art tensor formats for massively parallel GPUs use list-based [12,29,39] or tree-based [34,36] data structures to store high-dimensional sparse data in a modespecific form. In this section, we provide an overview of the flagged coordinate (F-COO) and MM-CSF formats, which are representative of the two main format categories for GPU architectures.…”

Section: Sparse Tensor Formats For Gpusmentioning

confidence: 99%

“…TD algorithms for high-dimensional sparse data are challenging to execute on emerging parallel architectures due to their low arithmetic intensity, irregular memory access, workload imbalance, and synchronization overhead [10,17]. To improve the performance of these memory-bound workloads, recent studies [12,29,34,36,39] exploit massively parallel architectures equipped with High Bandwidth Memory (HBM), namely GPUs, to accelerate the MTTKRP kernel. While such accelerators deliver memory bandwidth exceeding 2 TB/s [1], they suffer from limited memory capacity and high memoryaccess latency, which is in the order of hundreds of processor cycles [20,32].…”

Section: Introductionmentioning

confidence: 99%

“…However, these strategies result in formats that are mode-specific, where nonzero elements are organized/accessed according to a specific mode (i.e., dimension) orientation. Mode-specific formats typically require keeping multiple tensor copies [12,29,36] and/or extra mapping and scheduling information (e.g., flags or arrays) about groups of data-dependent non-zero elements for every mode [12,29,39], which can significantly increase their overall memory footprint.…”

Section: Introductionmentioning

confidence: 99%

See 4 more Smart Citations

Efficient, Out-of-Memory Sparse MTTKRP on Massively Parallel Architectures

Nguyen,

Helal,

Checconi

et al. 2022

Preprint

View full text Add to dashboard Cite

Tensor decomposition (TD) is an important method for extracting latent information from high-dimensional (multi-modal) sparse data. This study presents a novel framework for accelerating fundamental TD operations on massively parallel GPU architectures. In contrast to prior work, the proposed Blocked Linearized CoOrdinate (BLCO) format enables efficient out-of-memory computation of tensor algorithms using a unified implementation that works on a single tensor copy. Our adaptive blocking and linearization strategies not only meet the resource constraints of GPU devices, but also accelerate data indexing, eliminate control-flow and memory-access irregularities, and reduce kernel launching overhead. To address the substantial synchronization cost on GPUs, we introduce an opportunistic conflict resolution algorithm that discovers and resolves conflicting updates across threads on-the-fly, without keeping any auxiliary information or storing non-zero elements in specific mode orientations. As a result, our framework delivers superior in-memory performance compared to prior state-of-the-art, and is the only framework capable of processing out-of-memory tensors. On the latest Intel and NVIDIA GPUs, BLCO achieves 2.12 − 2.6× geometric-mean speedup (with up to 33.35× speedup) over the state-of-the-art mixed-mode compressed sparse fiber (MM-CSF) on a range of real-world sparse tensors.

show abstract

Section: Related Workmentioning

confidence: 99%

Section: Related Workmentioning

confidence: 99%

Section: Sparse Tensor Formats For Gpusmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 3 more Smart Citations

Efficient, Out-of-Memory Sparse MTTKRP on Massively Parallel Architectures

Nguyen,

Helal,

Checconi

et al. 2022

Preprint

View full text Add to dashboard Cite

show abstract

ScalFrag: Efficient Tiled-MTTKRP with Adaptive Launching on GPUs

Lin,

Wang,

Deng

et al. 2024

2024 IEEE International Conference on Cluster Computing (CLUSTER)

View full text Add to dashboard Cite

Efficient, out-of-memory sparse MTTKRP on massively parallel architectures

Nguyen¹,

Helal

Checconi

et al. 2022

Proceedings of the 36th ACM International Conference on Supercomputing

View full text Add to dashboard Cite

Tensor decomposition (TD) is an important method for extracting latent information from high-dimensional (multi-modal) sparse data. This study presents a novel framework for accelerating fundamental TD operations on massively parallel GPU architectures. In contrast to prior work, the proposed Blocked Linearized CoOrdinate (BLCO) format enables efficient out-of-memory computation of tensor algorithms using a unified implementation that works on a single tensor copy. Our adaptive blocking and linearization strategies not only meet the resource constraints of GPU devices, but also accelerate data indexing, eliminate control-flow and memoryaccess irregularities, and reduce kernel launching overhead. To address the substantial synchronization cost on GPUs, we introduce an opportunistic conflict resolution algorithm, in which threads collaborate instead of contending on memory access to discover and resolve their conflicting updates on-the-fly, without keeping any auxiliary information or storing non-zero elements in specific mode orientations. As a result, our framework delivers superior in-memory performance compared to prior state-of-the-art, and is the only framework capable of processing out-of-memory tensors. On the latest Intel and NVIDIA GPUs, BLCO achieves 2.12 − 2.6× geometric-mean speedup (with up to 33.35× speedup) over the state-of-the-art mixed-mode compressed sparse fiber (MM-CSF) on a range of real-world sparse tensors. CCS CONCEPTS• Mathematics of computing → Mathematical software performance; • Computing methodologies → Massively parallel algorithms.

show abstract

An optimized tensor completion library for multiple GPUs

Cited by 3 publications

References 37 publications

Efficient, Out-of-Memory Sparse MTTKRP on Massively Parallel Architectures

Efficient, Out-of-Memory Sparse MTTKRP on Massively Parallel Architectures

ScalFrag: Efficient Tiled-MTTKRP with Adaptive Launching on GPUs

Efficient, out-of-memory sparse MTTKRP on massively parallel architectures

Contact Info

Product

Resources

About