Auto-Tuning of Thread Assignment for Matrix-Vector Multiplication on GPUs

Wang, Jinwei; Ma, Xiang; Zhu, Yun; Sun, Jizhou

doi:10.1587/transinf.e96.d.2319

Cited by 3 publications

(2 citation statements)

References 17 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Thus, when the matrix is very wide, each thread conducts a large number of calculations and the performance will observably reduce. Therefore, we designed a novel autotuning method for matrix-vector multiplication on GPUs, where the number of threads used to compute one element of the result vector can be autotuned according to the matrix size [ 29 ]. For very wide matrices, thousands of threads are used to compute one element of the result vector.…”

Section: Design Of Parallel Aam Fitting Algorithm For Gpusmentioning

confidence: 99%

Efficient Parallel Implementation of Active Appearance Model Fitting Algorithm on GPU

Wang

Zhu

et al. 2014

The Scientific World Journal

Self Cite

View full text Add to dashboard Cite

The active appearance model (AAM) is one of the most powerful model-based object detecting and tracking methods which has been widely used in various situations. However, the high-dimensional texture representation causes very time-consuming computations, which makes the AAM difficult to apply to real-time systems. The emergence of modern graphics processing units (GPUs) that feature a many-core, fine-grained parallel architecture provides new and promising solutions to overcome the computational challenge. In this paper, we propose an efficient parallel implementation of the AAM fitting algorithm on GPUs. Our design idea is fine grain parallelism in which we distribute the texture data of the AAM, in pixels, to thousands of parallel GPU threads for processing, which makes the algorithm fit better into the GPU architecture. We implement our algorithm using the compute unified device architecture (CUDA) on the Nvidia's GTX 650 GPU, which has the latest Kepler architecture. To compare the performance of our algorithm with different data sizes, we built sixteen face AAM models of different dimensional textures. The experiment results show that our parallel AAM fitting algorithm can achieve real-time performance for videos even on very high-dimensional textures.

show abstract

Section: Design Of Parallel Aam Fitting Algorithm For Gpusmentioning

confidence: 99%

Efficient Parallel Implementation of Active Appearance Model Fitting Algorithm on GPU

Wang

Zhu

et al. 2014

The Scientific World Journal

Self Cite

View full text Add to dashboard Cite

show abstract

“…GPU on the other hand is capable of executing more GFLOPS than normal CPU. It provides a highly parallel computing environment suitable for numerous data parallel arithmetic computations such as dense linear algebraic operations 13 . However, the only restriction in earlier GPU version lies in its lack of support for IEEE FP Standards 12 .…”

Section: Gpu and Cudamentioning

confidence: 99%

Padding Free Bank Conflict Resolution for CUDA-Based Matrix Transpose Algorithm

Khan¹,

Al-Mouhamed²,

Fatayer³

et al. 2014

IJNDC

View full text Add to dashboard Cite

The advances of Graphic Processing Units (GPU) technology and the introduction of CUDA programming model facilitates developing new solutions for sparse and dense linear algebra solvers. Matrix Transpose is an important linear algebra procedure that has deep impact in various computational science and engineering applications. Several factors hinder the expected performance of large matrix transpose on GPU devices. The degradation in performance involves the memory access pattern such as coalesced access in the global memory and bank conflict in the shared memory of streaming multiprocessors within the GPU. In this paper, two matrix transpose algorithms are proposed to alleviate the aforementioned issues of ensuring coalesced access and conflict free bank access. The proposed algorithms have comparable execution times with the NVIDIA SDK bank conflict -free matrix transpose implementation. The main advantage of proposed algorithms is that they eliminate bank conflicts while allocating shared memory exactly equal to the tile size (T x T) of the problem space. However, to the best of our knowledge an extra space of Tx(T+1) needs to be allocated in the published research. We have also applied the proposed transpose algorithm to recursive gaussian implementation of NVIDIA SDK and achieved about 6% improvement in performance.

show abstract

Optimization Techniques for GPU Programming

et al. 2023

View full text Add to dashboard Cite

In the past decade, Graphics Processing Units have played an important role in the field of high-performance computing and they still advance new fields such as IoT, autonomous vehicles, and exascale computing. It is therefore important to understand how to extract performance from these processors, something that is not trivial. This survey discusses various optimization techniques found in 450 papers published in the last 14 years. We analyze the optimizations from different perspectives which shows that the various optimizations are highly interrelated, explaining the need for techniques such as auto-tuning.

show abstract

Auto-Tuning of Thread Assignment for Matrix-Vector Multiplication on GPUs

Cited by 3 publications

References 17 publications

Efficient Parallel Implementation of Active Appearance Model Fitting Algorithm on GPU

Efficient Parallel Implementation of Active Appearance Model Fitting Algorithm on GPU

Padding Free Bank Conflict Resolution for CUDA-Based Matrix Transpose Algorithm

Optimization Techniques for GPU Programming

Contact Info

Product

Resources

About