Automatic data movement and computation mapping for multi-level parallel architectures with explicitly managed memories

Baskaran, Muthu Manikandan; Bondhugula, Uday; Krishnamoorthy, Sriram; Ramanujam, J.; Rountev, Atanas; Sadayappan, P.

doi:10.1145/1345206.1345210

Cited by 79 publications

(70 citation statements)

References 42 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…This naive automatic scheme transfers the read set into the GPU when a kernel is invoked, and copies the write set to the CPU immediately after the kernel ends 1 . It is easy to implement in a compiler and has been used in an initial version of Chapel [19] for the GPU and with minor variations in the OpenMP to GPU compiler [12].…”

Section: The Need For Efficient Memory Managementmentioning

confidence: 99%

See 1 more Smart Citation

Fast and efficient automatic memory management for GPUs using compiler-assisted runtime coherence scheme

Pai

Govindarajan

Thazhuthaveetil

2012

Proceedings of the 21st International Conference on Parallel Architectures and Compilation Techniques

View full text Add to dashboard Cite

Exploiting the performance potential of GPUs requires managing the data transfers to and from them efficiently which is an errorprone and tedious task. In this paper, we develop a software coherence mechanism to fully automate all data transfers between the CPU and GPU without any assistance from the programmer. Our mechanism uses compiler analysis to identify potential stale accesses and uses a runtime to initiate transfers as necessary. This allows us to avoid redundant transfers that are exhibited by all other existing automatic memory management proposals.We integrate our automatic memory manager into the X10 compiler and runtime, and find that it not only results in smaller and simpler programs, but also eliminates redundant memory transfers. Tested on eight programs ported from the Rodinia benchmark suite it achieves (i) a 1.06x speedup over hand-tuned manual memory management, and (ii) a 1.29x speedup over another recently proposed compiler-runtime automatic memory management system. Compared to other existing runtime-only and compiler-only proposals, it also transfers 2.2x to 13.3x less data on average.

show abstract

Section: The Need For Efficient Memory Managementmentioning

confidence: 99%

“…Baskaran et al develop an automatic polyhedral model-based framework [1] to insert data transfers statically. Their method is closely tied to their parallelization framework and does not involve any runtime coherence mechanism to avoid transfers of non-stale data.…”

Section: Related Workmentioning

confidence: 99%

Fast and efficient automatic memory management for GPUs using compiler-assisted runtime coherence scheme

Pai

Govindarajan

Thazhuthaveetil

2012

Proceedings of the 21st International Conference on Parallel Architectures and Compilation Techniques

View full text Add to dashboard Cite

show abstract

“…Several papers [11], [12], [13], [14], [15], [17] discuss porting applications to GPUs and improving performance through an optimal assignment of architectural parameters to achieve overall execution efficiency. Our goal, in contrast, deals with a design methodology that improves performance of an application that consists of a composition of kernels.…”

Section: Related Workmentioning

confidence: 99%

Reuse and Refactoring of GPU Kernels to Design Complex Applications

Sarkar

Mitra

Srinivasan

2012

2012 IEEE 10th International Symposium on Parallel and Distributed Processing With Applications

View full text Add to dashboard Cite

Abstract-Developers of GPU kernels, such as FFT, linear solvers, etc, tune their code extensively in order to obtain optimal performance, making efficient use of different resources available on the GPU. Complex applications are composed of several such kernel components. The software engineering community has performed extensive research on componentbased design to build generic and flexible components, such that a component can be reused across diverse applications, rather than optimizing its performance. Since a GPU is used primarily to improve performance, application performance becomes a key design issue. The contribution of our work lies in extending component based design research in a new direction, dealing with the performance impact of refactoring an application consisting of the composition of highly tuned kernels. Such refactoring can make the composition more effective with respect to GPU resource usage especially when combined with suitable scheduling. Here we propose a methodology where developers of highly tuned kernels can enable application designers to optimize performance of the composition. Kernel developers characterize the performance of a kernel through its "performance signature". The application designer combines these kernels such that the performance of the refactored kernel is better than the sum of the performances of the individual kernels. This is partly based on the observation that different kernels may make unbalanced use of different GPU resources like different types of memory. Kernels may also have the potential to share data. Refactoring the kernels, combining them, and scheduling them suitably can improve performance. We study different types of potential design optimizations and evaluate their effectiveness on different types of kernels. This may even involve choosing non-optimal parameters for an individual kernel. We analyze how the performance signature of the composition changes from that of the individual kernels through our techniques. We demonstrate that our techniques lead to over 50% improvement with some kernels. Furthermore, the performance of a basic molecular dynamics application can be improved by around 25.7%, on a Fermi GPU, compared with an un-refactored implementation.

show abstract

“…Within the context of GPUs, another research direction involves optimization of shared memory use in GPUs, which are also a form of application-controlled cache. In this area, Baskaran et al have provided an approach for automatically arranging shared memory on NVIDIA GPU by using the polyhedral model for affine loops [4]. Moazeni et al have adapted approaches for register allocation, particularly those based on graph coloring, to manage shared memory on GPU [26].…”

Section: Related Workmentioning

confidence: 99%

Practical Loop Transformations for Tensor Contraction Expressions on Multi-level Memory Hierarchies

Krishnamoorthy

Agrawal

2011

Lecture Notes in Computer Science

Self Cite

View full text Add to dashboard Cite

Abstract. Modern architectures are characterized by deeper levels of memory hierarchy, often explicitly addressable. Optimizing applications for such architectures requires careful management of the data movement across all these levels. In this paper, we focus on the problem of mapping tensor contractions to memory hierarchies with more than two levels, specifically addressing placement of memory allocation and data movement statements, choice of loop fusions, and tile size selection. Existing algorithms to find an integrated solution to this problem even for two-level memory hierarchies have been shown to be expensive. We improve upon this work by focusing on the first-order cost components, simplifying the analysis required and reducing the number of candidates to be evaluated. We have evaluated our framework on a cluster of GPUs. Using five candidate tensor contraction expressions, we show that fusion at multiple levels improves performance, and our framework is effective in determining profitable transformations.

show abstract

Automatic data movement and computation mapping for multi-level parallel architectures with explicitly managed memories

Cited by 79 publications

References 42 publications

Fast and efficient automatic memory management for GPUs using compiler-assisted runtime coherence scheme

Fast and efficient automatic memory management for GPUs using compiler-assisted runtime coherence scheme

Reuse and Refactoring of GPU Kernels to Design Complex Applications

Practical Loop Transformations for Tensor Contraction Expressions on Multi-level Memory Hierarchies

Contact Info

Product

Resources

About