Adaptive block size for dense QR factorization in hybrid CPU–GPU systems via statistical modeling

Chen, Chih-Chia; Tsai, Yaohung M.; Wang, Weichung

doi:10.1016/j.parco.2014.03.001

Cited by 9 publications

(3 citation statements)

References 10 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Heterogeneous CPU-accelerator systems have recently been widely used in high-performance computing and cloud computing due to their advantages of high performance and low power consumption. Many works have focused on utilizing both CPUs and accelerators to accelerate solving a specific application, such as matrix multiplication [1], sparse matrixvector multiplication [2], QR factorization [3], Cholesky factorization [4], branch-and-bound algorithm [5], Smith-Waterman algorithm [6], subset-sum problem [7], particle swarm optimization [8], graph processing [9], range query [10], computational fluid dynamics [11], and atmospheric numerical simulation [12]. These works demonstrate that the CPU-accelerator co-processing yields better performance than the CPU-only execution or accelerator-only execution.…”

Section: Introductionmentioning

confidence: 99%

Efficient Inter-Device Task Scheduling Schemes for Multi-Device Co-Processing of Data-Parallel Kernels on Heterogeneous Systems

2021

View full text Add to dashboard Cite

Heterogeneous systems consisting of multiple multi-core CPUs and many-core accelerators have recently come into wide use, and more and more parallel applications are developed in such a heterogeneous system. To fully utilize multiple compute devices to cooperatively and concurrently execute data-parallel kernels on heterogeneous systems, a feedback-based dynamic and elastic task scheduling scheme is proposed, which can provide a better load balance, a greater device utilization, and a lower scheduling overhead by flexibly and dynamically adjusting the workload between devices during execution. The proposed method is more suitable for data-parallel kernels whose computation and data are uniformly distributed, but is less suitable for data-parallel kernels whose computation and data are non-uniformly distributed. Thus, an asynchronous-based dynamic and elastic task scheduling scheme is proposed, which can avoid device underutilization, load imbalance across devices, and frequent kernel launches, interdevice data transfers and inter-device synchronizations by dynamically adjusting the chunk size according to the performance change during runtime. A series of experiments are conducted with 8 representative parallel applications on a hybrid CPU-GPU-MIC system, the results show that the proposed two interdevice task scheduling schemes can achieve the efficient CPU-GPU-MIC co-processing of different parallel applications by effectively partitioning work across devices.INDEX TERMS Data-parallel kernels, heterogeneous systems, many-core accelerators, multi-core CPUs, multi-device co-processing, parallel applications, task scheduling.

show abstract

Section: Introductionmentioning

confidence: 99%

Efficient Inter-Device Task Scheduling Schemes for Multi-Device Co-Processing of Data-Parallel Kernels on Heterogeneous Systems

2021

View full text Add to dashboard Cite

show abstract

“…The CPU‐GPU cooperative computing has recently attracted the attention of many researchers and application developers. Some applications have been reported to successfully implement the CPU‐GPU cooperative computing, instead of the CPU‐only or GPU‐only computing, such as matrix multiplication , fast Fourier transformation , LU factorization , QR factorization , unsymmetric sparse linear system , radiation physics , molecular dynamics , conjugate gradient method , divide‐and‐conquer algorithm , and branch‐and‐bound algorithm . These works show that the CPU‐GPU cooperative computing has much better performance than the CPU‐only or GPU‐only computing.…”

Section: Introductionmentioning

confidence: 99%

Efficient CPU‐GPU cooperative computing for solving the subset‐sum problem

Wan

Liu

et al. 2015

Concurrency and Computation

View full text Add to dashboard Cite

SUMMARYHeterogeneous CPU-GPU system is a powerful way to accelerate compute-intensive applications, such as the subset-sum problem. Many parallel algorithms for solving the problem have been implemented on graphics processing units (GPUs). However, these GPU implementations may fail to fully utilize all the CPU cores and the GPU resources. When the GPU performs computational task, only one CPU core is used to control the GPUs, and all the remaining CPU cores are in idle state, which leads to large amounts of available CPU resources being wasted. This paper proposes an efficient CPU-GPU cooperative computing scheme for solving the subset-sum problem, which enables the full utilization of all the computing power of both CPUs and GPUs. In order to find the most appropriate task distribution ratio between CPUs and GPUs, this paper establishes a simple but effective task distribution model. Considering the high CPU-GPU communication overhead and the unbalanced workload between CPUs and GPUs may greatly reduce the performance, an incremental data transfer method is proposed to reduce the CPU-GPU communication overhead, and a feedback-based dynamic task distribution scheme is designed to effectively balance the workload between CPUs and GPUs during runtime. The experimental results show that the CPU-GPU cooperative computing achieves a significant performance benefit over the CPU-only or GPU-only computing.

show abstract

“…Automatic performance tuning of matrix libraries has been studied from various aspects. There are approaches based on exhaustive search, 15,16 incremental parameter sampling, 17 statistical models 18,19 and machine learning, 20,21 to mention a few. Among them, the approach of ATMathCoreLib is unique in that it is targeted at the finite horizon problem; it is designed to finish auto-tuning in a specified number of executions and minimize the total execution time.…”

mentioning

confidence: 99%

Automatic performance tuning using the ATMathCoreLib tool: Two experimental studies related to dense symmetric eigensolvers

Kobayashi

Hirota

Kudo

et al. 2023

Concurrency and Computation

View full text Add to dashboard Cite

SummaryWe consider automatic performance tuning of dense symmetric eigenvalue problems using ATMathCoreLib, which is a library to assist automatic tuning. We deal with two problems, namely, automatic code selection for the symmetric generalized eigenvalue problem in distributed‐memory parallel environments and automatic parameter tuning in tridiagonalization of dense symmetric matrices on multicore processors. As for the first problem, numerical experiments show that ATMathCoreLib can choose the fastest solver for a given computing environment and problem size quickly even if the fluctuation in the execution time is as high as 40%. As for the second problem, ATMathCoreLib was able to select nearly optimal combinations of the algorithm and its parameter reliably and efficiently for various computing environments and matrix sizes. The performance of auto‐tuning was further enhanced by incorporating a user‐provided execution‐time model into ATMathCoreLib.

show abstract

Adaptive block size for dense QR factorization in hybrid CPU–GPU systems via statistical modeling

Cited by 9 publications

References 10 publications

Efficient Inter-Device Task Scheduling Schemes for Multi-Device Co-Processing of Data-Parallel Kernels on Heterogeneous Systems

Efficient Inter-Device Task Scheduling Schemes for Multi-Device Co-Processing of Data-Parallel Kernels on Heterogeneous Systems

Efficient CPU‐GPU cooperative computing for solving the subset‐sum problem

Automatic performance tuning using the ATMathCoreLib tool: Two experimental studies related to dense symmetric eigensolvers

Contact Info

Product

Resources

About