Multi-level Optimization of Matrix Multiplication for GPU-equipped Systems

Matsumoto, Kazuya; Nakasato, Nobukazu; Sakai, Tomoya; Yahagi, Hideki; Sedukhin, Stanislav G.

doi:10.1016/j.procs.2011.04.036

Cited by 18 publications

(11 citation statements)

References 7 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…The MMM code for one core, is either given by Cilk tool [66] or by cblas sgemm routine of ATLAS. At last, [67] and [68] [76]. Reference [26] show how to modify the MAGMA GEMM kernels in order to use more efficient the Fermi architecture.…”

Section: Related Workmentioning

confidence: 99%

See 1 more Smart Citation

A high-performance matrix–matrix multiplication methodology for CPU and GPU architectures

Kelefouras¹,

Kritikakou²,

Mporas³

et al. 2016

J Supercomput

View full text Add to dashboard Cite

Current compilers cannot generate code that can compete with hand-tuned code in efficiency, even for a simple kernel like Matrix-Matrix Multiplication. A key step in program optimization is the estimation of optimal values for parameters such as tile sizes and number of levels of tiling. The scheduling parameter values selection is a very difficult and time-consuming task since parameter values depend on each other; this is why they are found by using searching methods and empirical techniques. To overcome this problem, the scheduling sub-problems must be optimized together, as one problem and not separately.In this paper a Matrix-Matrix Multiplication methodology is presented where the optimum scheduling parameters are found by decreasing the search space theoretically while the major scheduling sub-problems are addressed together as one problem and not separately according to the hardware architecture parameters and input size; for different hardware architecture parameters and/or input sizes, a different implementation is produced. This is achieved by fully exploiting the software characteristics (e.g., data reuse) and hardware architecture parameters (e.g., data caches sizes and associativities), giving high quality solutions and a smaller search space. This methodology refers to a wide range of CPU and GPU architectures.

show abstract

Section: Related Workmentioning

confidence: 99%

“…[75] provides theoretical analysis why performance drawbacks appear for specific problem sizes when using cache memories. Finally, in [76], different data arrays layouts are evaluated, such Z-Morton and X-Morton. All the above works, are empirical techniques and do not give a methodology.…”

Section: Related Workmentioning

confidence: 99%

A high-performance matrix–matrix multiplication methodology for CPU and GPU architectures

Kelefouras¹,

Kritikakou²,

Mporas³

et al. 2016

J Supercomput

View full text Add to dashboard Cite

show abstract

“…The whole set of computations can be seen as a 3D cube where element (i, k, j) corresponds to the basic operation a i,k b k,j . At the notable exception of recently introduced 2.5D schemes [42], all implementations (see [43] for a recent survey), including those implemented with MapReduce [36], [27] or designed for GPUs [44] are based on the ScaLAPACK algorithm [45], that uses the outer product described in Section IV-A as building block. For the sake of simplicity, we will concentrate on the case of square matrices only.…”

Section: B 3d Data Distribution: Matrix Multiplicationmentioning

confidence: 99%

Non Linear Divisible Loads: There is No Free Lunch

Beaumont

Larchevêque

Marchal

2013

2013 IEEE 27th International Symposium on Parallel and Distributed Processing

View full text Add to dashboard Cite

Abstract-Divisible Load Theory (DLT) has received a lot of attention in the past decade. A divisible load is a perfect parallel task, that can be split arbitrarily and executed in parallel on a set of possibly heterogeneous resources. The success of DLT is strongly related to the existence of many optimal resource allocation and scheduling algorithms, what strongly differs from general scheduling theory. Moreover, recently, close relationships have been underlined between DLT, that provides a fruitful theoretical framework for scheduling jobs on heterogeneous platforms, and MapReduce, that provides a simple and efficient programming framework to deploy applications on large scale distributed platforms.The success of both have suggested to extend their framework to non-linear complexity tasks. In this paper, we show that both DLT and MapReduce are better suited to workloads with linear complexity. In particular, we prove that divisible load theory cannot directly be applied to quadratic workloads, such as it has been proposed recently. We precisely state the limits for classical DLT studies and we review and propose solutions based on a careful preparation of the dataset and clever data partitioning algorithms. In particular, through simulations, we show the possible impact of this approach on the volume of communications generated by MapReduce, in the context of Matrix Multiplication and Outer Product algorithms.

show abstract

“…Since the appearance of CUDA programming, there is a big field of researches that have already carried out seeking better performances. This is the case of different computational cores, such as matrix multiplication [10], Boltzmann equation [11], or Parallel 3D fast wavelet transform [12].…”

Section: State Of Artmentioning

confidence: 99%

Headphone-based Spatial Sound with a GPU Accelerator

Belloch

Ferrer

González

et al. 2012

Procedia Computer Science

View full text Add to dashboard Cite

Multichannel acoustic signal processing has undergone major development in recent years. The incorporation of spatial information into an immersive audiovisual virtual environment or into video games provides better sense of "presence" to applications. Spatial sound consists in reproducing audio signals with spatial cues (spatial information embedded in the sound) through headphones. This spatial information allows the listener to identify the virtual positions of the sources corresponding to different sounds. Headphone-based spatial sound is obtained by filtering different sound sources through special filters called Head Related Transfer Functions (HRTFs) prior to render them through headphones. Efficient computation plays an important role when the number of sources to be managed is high. This situation increases the number of filtering operations, requiring high computing capacity specially when the virtual sources are moving. Graphics Processing Units (GPUs) are high parallel programmable co-processors that provide massive computation when the needed operations are properly parallelized. This paper discusses the design, the implementation and the performance of a headphone-based spatial audio application whose processing is totally carried out on a GPU. This application is able to interact with the listener who can select and change the location of the sound sources in real-time. This work analyzes also specific computational aspects inside the CUDA environment in order to successfully exploit GPU resources. Results show that the proposed application is able to move up to 2500 sources simultaneously, while leaving free CPU resources for other tasks. This work emphasizes the importance of analyzing all CUDA aspects, since they can influence drastically the performance.

show abstract

Multi-level Optimization of Matrix Multiplication for GPU-equipped Systems

Cited by 18 publications

References 7 publications

A high-performance matrix–matrix multiplication methodology for CPU and GPU architectures

A high-performance matrix–matrix multiplication methodology for CPU and GPU architectures

Non Linear Divisible Loads: There is No Free Lunch

Headphone-based Spatial Sound with a GPU Accelerator

Contact Info

Product

Resources

About