In-place transposition of rectangular matrices on accelerators

Sung, I-Jui; Gómez-Luna, Juan; González-Linares, José María; Guil, Nicolás; Hwu, Wen-mei W.

doi:10.1145/2692916.2555266

Cited by 12 publications

(18 citation statements)

References 18 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…However, these papers do not target TLB related issues which arise when large data structures are processed by the memory bound algorithms. Matrix transposition solutions presented in [3,24,26] include many GPU specific optimisations, yet they also do not consider the impact of the TLB on algorithm performance.…”

Section: Prior Artmentioning

confidence: 99%

Efficient Processing of Large Data Structures on GPUs: Enumeration Scheme Based Optimisation

Gorawski

Lorek

2017

Int J Parallel Prog

View full text Add to dashboard Cite

The purpose of this paper is to highlight the performance issues of the matrix transposition algorithms for large matrices, relating to the Translation Lookaside Buffer (TLB) cache. The existing optimisation techniques such as coalesced access and the use of shared memory, regardless of their necessity and benefits, are not sufficient enough to neutralise the problem. As the data problem size increases, these optimisations do not exploit data locality effectively enough to counteract the detrimental effects of TLB cache misses. We propose a new optimisation technique that counteracts the performance degradation of these algorithms and seamlessly complements current optimisations. Our optimisation is based on detailed analysis of enumeration schemes that can be applied to either individual matrix entries or blocks (sub-matrices). The key advantage of these enumeration schemes is that they do not incur matrix storage format conversion because they operate on canonical matrix layouts. In addition, several cache-efficient matrix transposition algorithms based on enumeration schemes are offered-an improved version of the in-place algorithm for square matrices, outof-place algorithm for rectangular matrices and two 3D involutions. We demonstrate that the choice of the enumeration schemes and their parametrisation can have a direct and significant impact on the algorithm's memory access pattern. Our in-place version of the algorithm delivers up to 100% performance improvement over the existing optimisation techniques. Meanwhile, for the out-of-place version we observe up to 300% performance gain over the NVidia's algorithm. We also offer improved versions of two involution transpositions for the 3D matrices that can achieve performance increase 123Int J Parallel Prog up 300%. To the best of our knowledge, this is the first effective attempt to control the logical-to-physical block association through the design of enumeration schemes in the context of matrix transposition.

show abstract

Section: Prior Artmentioning

confidence: 99%

Efficient Processing of Large Data Structures on GPUs: Enumeration Scheme Based Optimisation

Gorawski

Lorek

2017

Int J Parallel Prog

View full text Add to dashboard Cite

show abstract

“…It allows the rows and columns to be operated on independently, reducing work complexity and auxiliary space. Catanzaro et al compare their implementation to our original 3-stage approach [10], which is improved in the present work. Fig.…”

Section: In-place and Out-of-place Transposition For Gpusmentioning

confidence: 99%

“…In [10] we showed that the 4-stage approach presents some issues that limit its throughput on GPUs. For instance, the transposition 1000 !…”

Section: Full Transposition As a Sequence Of Elementary Tiled Transpomentioning

confidence: 99%

“…We presented in [10] a 4-stage approach, based on a method for multicore CPUs [11], and a 3stage approach that improved spatial locality. Both approaches used elementary tile-wise transpositions for each stage.…”

Section: Introductionmentioning

confidence: 99%

“…We compare them, and detect which of them can be advantageously used for particular matrix characteristics. • We optimize the elementary transpositions, which were employed by our original 3-stage tile-wise transposition [10], and adapt them to work with 32-bit and 64-bit elements. We also perform exhaustive tests to identify the best execution configuration for each tile size.…”

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

In-Place Matrix Transposition on GPUs

Gómez-Luna

Sung²,

Chang

et al. 2016

IEEE Trans. Parallel Distrib. Syst.

Self Cite

View full text Add to dashboard Cite

Matrix transposition is an important algorithmic building block for many numeric algorithms such as FFT. With more and more algebra libraries offloading to GPUs, a high performance in-place transposition becomes necessary. Intuitively, in-place transposition should be a good fit for GPU architectures due to limited available on-board memory capacity and high throughput. However, direct application of CPU in-place transposition algorithms lacks the amount of parallelism and locality required by GPU to achieve good performance. In this paper we present our in-place matrix transposition approach for GPUs that is performed using elementary tile-wise transpositions. We propose low-level optimizations for the elementary transpositions, and find the best performing configurations for them. Then, we compare all sequences of transpositions that achieve full transposition, and detect which is the most favorable for each matrix. We present an heuristic to guide the selection of tile sizes, and compare them to brute-force search. We diagnose the drawback of our approach, and propose a solution using minimal padding. With fast padding and unpadding kernels, the overall throughput is significantly increased. Finally, we compare our method to another recent implementation.

show abstract

Algorithms for in‐place matrix transposition

Gustavson¹,

Walker

2018

Concurrency and Computation

View full text Add to dashboard Cite

Summary This paper presents implementations of in‐place algorithms for transposing rectangular matrices. One implementation is a swap‐based algorithm described by Tretyakov and Tyrtyshnikov,1 to which we have introduced a number of variations. In particular, we show how the original algorithm can be modified to require constant additional memory. A proof of correctness is also sketched. This algorithm is compared with cycle‐following approaches and with the swap‐based GCD Transpose algorithm that partitions the matrix into a hierarchy of square submatrices. The performance of parallel implementations on a multicore system is also investigated.

show abstract

In-place transposition of rectangular matrices on accelerators

Cited by 12 publications

References 18 publications

Efficient Processing of Large Data Structures on GPUs: Enumeration Scheme Based Optimisation

Efficient Processing of Large Data Structures on GPUs: Enumeration Scheme Based Optimisation

In-Place Matrix Transposition on GPUs

Algorithms for in‐place matrix transposition

Contact Info

Product

Resources

About