TSM2X: High-performance tall-and-skinny matrix–matrix multiplication on GPUs

Rivera, Cody; Chen, Jieyang; Xiong, Nan; Zhang, Jing; Song, Shuaiwen Leon; Tao, Dingwen

doi:10.1016/j.jpdc.2021.02.013

Cited by 19 publications

(9 citation statements)

References 10 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…The challenge emerges when any one of the dimensions is small relative to the other two; in this case, the operational intensity approaches O(1), requiring highly efficient data movement to avoid becoming memory-bound. Such "tall-and-skinny" matrices are difficult to process efficiently on GPUs [24]. While operational intensity can sometimes be addressed by processing multiple inputs simultaneously via batching, this may not be an option for latency-sensitive inference operations where input must be processed as soon as it is received.…”

Section: B Workload Taxonomymentioning

confidence: 99%

Neuro-Symbolic AI: An Emerging Class of AI Workloads and their Characterization

Susskind¹,

Arden²,

John³

et al. 2021

Preprint

View full text Add to dashboard Cite

Neuro-symbolic artificial intelligence is a novel area of AI research which seeks to combine traditional rules-based AI approaches with modern deep learning techniques. Neurosymbolic models have already demonstrated the capability to outperform state-of-the-art deep learning models in domains such as image and video reasoning. They have also been shown to obtain high accuracy with significantly less training data than traditional models. Due to the recency of the field's emergence and relative sparsity of published results, the performance characteristics of these models are not well understood. In this paper, we describe and analyze the performance characteristics of three recent neuro-symbolic models. We find that symbolic models have less potential parallelism than traditional neural models due to complex control flow and low-operational-intensity operations, such as scalar multiplication and tensor addition. However, the neural aspect of computation dominates the symbolic part in cases where they are clearly separable. We also find that data movement poses a potential bottleneck, as it does in many ML workloads.

show abstract

Section: B Workload Taxonomymentioning

confidence: 99%

Neuro-Symbolic AI: An Emerging Class of AI Workloads and their Characterization

Susskind¹,

Arden²,

John³

et al. 2021

Preprint

View full text Add to dashboard Cite

show abstract

“…One challenge to design the algorithm is to simultaneously consider maximizing coalesced global memory access patterns, minimize bank conflict in accessing shared memory, and minimize thread divergence. We use a dynamic data-thread assignment strategy [17][18][19][20][21] to optimize both the accessing and computation of coefficients.…”

Section: Iterative Processing Kernel (Ipk)mentioning

confidence: 99%

Accelerating Multigrid-based Hierarchical Scientific Data Refactoring on GPUs

Chen

Wan

Liang

et al. 2021

2021 IEEE International Parallel and Distributed Processing Symposium (IPDPS)

Self Cite

View full text Add to dashboard Cite

Rapid growth in scientific data and a widening gap between computational speed and I/O bandwidth makes it increasingly infeasible to store and share all data produced by scientific simulations. Instead, we need methods for reducing data volumes: ideally, methods that can scale data volumes adaptively so as to enable negotiation of performance and fidelity tradeoffs in different situations. Multigrid-based hierarchical data representations hold promise as a solution to this problem, allowing for flexible conversion between different fidelities so that, for example, data can be created at high fidelity and then transferred or stored at lower fidelity via logically simple and mathematically sound operations. However, the effective use of such representations has been hindered until now by the relatively high costs of creating, accessing, reducing, and otherwise operating on such representations. We describe here highly optimized data refactoring kernels for GPU accelerators that enable efficient creation and manipulation of data in multigrid-based hierarchical forms. We demonstrate that our optimized design can achieve up to 264 TB/s aggregated data refactoring throughput-92% of theoretical peak-on 1024 nodes of the Summit supercomputer. We showcase our optimized design by applying it to a large-scale scientific visualization workflow and the MGARD lossy compression software.

show abstract

“…As the type of processor that contributes the most of the computing parallelism in many current and future HPC systems, Graphics Processing Units (GPUs), equipped with thousands of low-power cores, offer high computational power and energy efficiency. Many applications and libraries have been designed and optimized for GPU accelerators [1,3,8,9,13,25,34,36,42,43]. Benefiting from the fact that GPUs are designed for highly parallelizable computations while CPUs are more efficient with serial computations, CPUs and GPUs that are linked through fast interconnections [30,31] are usually used together to form heterogeneous systems that can efficiently handle a large spectrum of scientific computing workloads.…”

Section: Introductionmentioning

confidence: 99%

Improving Energy Saving of One-sided Matrix Decompositions on CPU-GPU Heterogeneous Systems

Chen,

Liang,

Zhao

et al. 2023

Preprint

View full text Add to dashboard Cite

One-sided dense matrix decompositions (e.g., Cholesky, LU, and QR) are the key components in scientific computing in many different fields. Although their design has been highly optimized for modern processors, they still consume a considerable amount of energy. As CPU-GPU heterogeneous systems are commonly used for matrix decompositions, in this work, we aim to further improve the energy saving of onesided matrix decompositions on CPU-GPU heterogeneous systems. We first build an Algorithm-Based Fault Tolerance protected overclocking technique (ABFT-OC) to enable us to exploit reliable overclocking for key matrix decomposition operations. Then, we design an energy-saving matrix decomposition framework, Bi-directional Slack Reclamation (BSR), that can intelligently combine the capability provided by ABFT-OC and DVFS to maximize energy saving and maintain performance and reliability. Experiments show that BSR is able to save up to 11.7% more energy compared with the current best energy saving optimization approach with no performance degradation and up to 14.1% 𝐸𝑛𝑒𝑟𝑔𝑦×𝐷𝑒𝑙𝑎𝑦 2 reduction. Also, BSR enables the Pareto efficient performanceenergy trade-off, which is able to provide up to 1.43× performance improvement without costing extra energy.CCS Concepts: • Hardware → Power and energy; • Computer systems organization → Dependable and fault-tolerant systems and networks.

show abstract

TSM2X: High-performance tall-and-skinny matrix–matrix multiplication on GPUs

Cited by 19 publications

References 10 publications

Neuro-Symbolic AI: An Emerging Class of AI Workloads and their Characterization

Neuro-Symbolic AI: An Emerging Class of AI Workloads and their Characterization

Accelerating Multigrid-based Hierarchical Scientific Data Refactoring on GPUs

Improving Energy Saving of One-sided Matrix Decompositions on CPU-GPU Heterogeneous Systems

Contact Info

Product

Resources

About