Optimized Dense Matrix Multiplication on a Many-Core Architecture

Garcia, Elkin; Venetis, Ioannis E.; Khan, Rishi; Gao, Guang R.

doi:10.1007/978-3-642-15291-7_29

Cited by 18 publications

(20 citation statements)

References 15 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…The limitation of memory bandwidth on many-core applications has been addressed to other levels in the memory hierarchy for some linear algebra applications [10,3]. They have proposed alternatives to find optimum tiling to the register level and mechanism for hiding memory latency.…”

Section: Introductionmentioning

confidence: 99%

Locality Optimization of Stencil Applications Using Data Dependency Graphs

Orozco

Garcia

Gao

2011

Languages and Compilers for Parallel Computing

Self Cite

View full text Add to dashboard Cite

Abstract. This paper proposes tiling techniques based on data dependencies and not in code structure. The work presented here leverages and expands previous work by the authors in the domain of non traditional tiling for parallel applications. The main contributions of this paper are: (1) A formal description of tiling from the point of view of the data produced and not from the source code. (2) A mathematical proof for an optimum tiling in terms of maximum reuse for stencil applications, addressing the disparity between computation power and memory bandwidth for many-core architectures. (3) A description and implementation of our tiling technique for well known stencil applications. (4) Experimental evidence that confirms the effectiveness of the tiling proposed to alleviate the disparity between computation power and memory bandwidth for many-core architectures. Our experiments, performed using one of the first Cyclops-64 many-core chips produced, confirm the effectiveness of our approach to reduce the total number of memory operations of stencil applications as well as the running time of the application.

show abstract

Section: Introductionmentioning

confidence: 99%

Locality Optimization of Stencil Applications Using Data Dependency Graphs

Orozco

Garcia

Gao

2011

Languages and Compilers for Parallel Computing

Self Cite

View full text Add to dashboard Cite

show abstract

“…There are several works optimizing MMM for many cores [53][54][55][56][57][58][59][60][61][62][63][64]. The fastest implementations are given in [23] where MMM is parallelized on Intel Xeon Phi and on IBM Blue Gene/Q; an analysis is made on which loop is going to be parallelized.…”

Section: Related Workmentioning

confidence: 99%

A high-performance matrix–matrix multiplication methodology for CPU and GPU architectures

Kelefouras¹,

Kritikakou²,

Mporas³

et al. 2016

J Supercomput

View full text Add to dashboard Cite

Current compilers cannot generate code that can compete with hand-tuned code in efficiency, even for a simple kernel like Matrix-Matrix Multiplication. A key step in program optimization is the estimation of optimal values for parameters such as tile sizes and number of levels of tiling. The scheduling parameter values selection is a very difficult and time-consuming task since parameter values depend on each other; this is why they are found by using searching methods and empirical techniques. To overcome this problem, the scheduling sub-problems must be optimized together, as one problem and not separately.In this paper a Matrix-Matrix Multiplication methodology is presented where the optimum scheduling parameters are found by decreasing the search space theoretically while the major scheduling sub-problems are addressed together as one problem and not separately according to the hardware architecture parameters and input size; for different hardware architecture parameters and/or input sizes, a different implementation is produced. This is achieved by fully exploiting the software characteristics (e.g., data reuse) and hardware architecture parameters (e.g., data caches sizes and associativities), giving high quality solutions and a smaller search space. This methodology refers to a wide range of CPU and GPU architectures.

show abstract

“…In general, fine grain execution is useful only when the overhead associated with the execution is acceptable. In contrast, coarse-grained executions decrease the proportional overhead of task management at the cost of reducing parallelism and reducing the opportunities for load balancing in many-core systems [9].…”

Section: Motivationmentioning

confidence: 99%

Polytasks: A Compressed Task Representation for HPC Runtimes

Orozco

Garcia

Pavel

et al. 2013

Languages and Compilers for Parallel Computing

Self Cite

View full text Add to dashboard Cite

Abstract. The increased number of execution units in many-core processors is driving numerous paradigm changes in parallel systems. Previous techniques that focused solely upon obtaining correct results are being rendered obsolete unless they can also provide results efficiently. This paper dives into the particular problem of efficiently supporting fine-grained task creation and task termination for runtime systems in shared memory processors. Our contributions are inspired by our observation of High Performance Computing (HPC) programs, where it is common for a large number of similar fine-grained tasks to become enabled at the same time. We present evidence showing that task creation, assignment of tasks to processors, and task termination represent a significant overhead when executing fine-grained applications in many-core processors. We introduce the concept of the polytask, wherein the similarity of tasks created at the same time is exploited to allow faster task creation, assignment and termination. The polytask technique can be applied to any runtime system where tasks are managed through queues. The main contributions of this work are:1. The observation that task management may generate substantial overhead in fine-grained parallel programs for many core processors. 2. The introduction of the polytask concept: A data structure that can be added to queue-centric scheduling systems to represent groups of similar tasks. 3. Experimental evidence showing that the polytask is an effective way to implement fine-grained task creation/termination primitives for parallel runtime systems in many-core processors. We use microbenchmarks to show that queues modified to handle polytasks perform orders of magnitude faster than traditional queues in some scenarios. Furthermore, we use microbenchmarks to measure the amount of time spent executing tasks. We show situations where fine-grained programs using polytasks are able to achieve efficiencies close to 100% while their efficiency becomes only 20% when not using polytasks. Finally, we use several applications with fine granularity to show that the use of polytasks results in average speedups from 1.4X to 100X depending on the queue implementation used.

show abstract

Optimized Dense Matrix Multiplication on a Many-Core Architecture

Cited by 18 publications

References 15 publications

Locality Optimization of Stencil Applications Using Data Dependency Graphs

Locality Optimization of Stencil Applications Using Data Dependency Graphs

A high-performance matrix–matrix multiplication methodology for CPU and GPU architectures

Polytasks: A Compressed Task Representation for HPC Runtimes

Contact Info

Product

Resources

About