Analysis and transformation in an interactive parallel programming tool

Kennedy, Ken; McKinley, Kathryn S.; Tseng, Chau‐Wen

doi:10.1002/cpe.4330050705

Cited by 20 publications

(9 citation statements)

References 36 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…We implemented the analysis using the Parascope parallelizing environment [18]. We modified the Parascope analysis to work on explicitly parallel programs written for the release consistency model.…”

Section: Implementation and Limitationsmentioning

confidence: 99%

See 1 more Smart Citation

An integrated compile-time/run-time software distributed shared memory system

Dwarkadas

Cox

Zwaenepoel

1996

SIGOPS Oper. Syst. Rev.

View full text Add to dashboard Cite

On a distributed memory machine, hand-coded message passing leads to the most efficient execution, but it is difficult to use. Parallelizing compilers can approach the performance of hand-coded message passing by translating data-parallel programs into message passing programs, but efficient execution is limited to those programs for which precise analysis can be carried out. Shared memory is easier to program than message passing and its domain is not constrained by the limitations of parallelizing compilers, but it lags in performance. Our goal is to close that performance gap while retaining the benefits of shared memory. In other words, our goal is (1) to make shared memory as efficient as message passing, whether hand-coded or compiler-generated, (2) to retain its ease of programming, and (3) to retain the broader class of applications it supports.To this end we have designed and implemented an integrated compile-time and run-time software DSM system. The programming model remains identical to the original pure run-time DSM system. No user intervention is required to obtain the benefits of our system. The compiler computes data access patterns for the individual processors. It then performs a source-to-source transformation, inserting in the program calls to inform the run-time system of the computed data access patterns. The run-time system uses this information to aggregate communication, to aggregate data and synchronization into a single message, to eliminate consistency overhead, and to replace global synchronization with point-to-point synchronization wherever possible.We extended the Paxascope programming environment to perform the required analysis, and we augmented the TreadMaxks run-time DSM library to take advantage of the analysis. We used six Fortran programs to assess the performance benefits: Jacobi, 3D-FFT, Integer Sort, Shallow, Gauss, and Modified Gramm-Schmidt, each with two different data set sizes. The experiments were run on an 8-node IBM SP/2 using user-space communication. Compiler optimization in conjunction with the augmented run-time system achieves substantial execution time improvements in comparison to the base TreadMaxks, ranging from 4% to 59% on 8 processors. Relative to message passing implePermission to make digital/hard copy of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commemial advantage, the copyright notice, the title o1 the publication and its date appear, and notice is given that copying is by permission of ACM, Inc. To copy otherwise, to republish, to post on servers, or to redistribute to lists, requires prior specific permission and/or a fee.ASPLOS VII 10/96 MA, USA ¢) 1996 ACM 0-89791-767-7/96/0010...$3.50 mentations of the same applications, the compile-time runtime system is 0-29% slower than message passing, while the base run-time system is 5-212% slower. For the five programs that XHPF could parallelize (all except IS), the execution times achieved by t...

show abstract

Section: Implementation and Limitationsmentioning

confidence: 99%

“…We extended the Parascope parallel programming environment [18] to analyze and transform explicitly parallel programs. We also extended the interface to the TreadMarks run-time DSM system [2] to take advantage of the compiler analysis.…”

Section: Introductionmentioning

confidence: 99%

An integrated compile-time/run-time software distributed shared memory system

Dwarkadas

Cox

Zwaenepoel

1996

SIGOPS Oper. Syst. Rev.

View full text Add to dashboard Cite

show abstract

“…Directives included parallel loops, variable privatization, and critical sections. On the sequential version, we then used the advanced analysis and transformations available in our interactive parallel programming tool, the ParaScope Editor (PED) [6,14], to perform our parallel code generation algorithm. Although the individual transformations were automated, the code generation algorithm was not.…”

Section: Creating Program Versionsmentioning

confidence: 99%

Evaluating automatic parallelization for efficient execution on shared-memory multiprocessors

McKinley

1994

Proceedings of the 8th International Conference on Supercomputing - ICS '94

Self Cite

View full text Add to dashboard Cite

We present a parallel code generation algorithm for complete applications and a new experimental methodology that tests the efficacy of our approach. The algorithm optimizes for data locality and parallelism, reducing or eliminating false sharing. It also uses interprocedural analysis and transformations to improve the granularity of parallelism. Although the individual components of the algorithm have been published previously, their coordination is unique to this paper. For experimental validation, we do not attempt to parallelize 'dusty deck' programs where many have tried and failed. Instead, we collect programs where the users tried to achieve excellent parallel performance. We apply our optimizations to sequential versions of these programs, i.e., the compiler was required to use its analysis and algorithms to parallelize the program and could not rely on user assertions that for example, a loop is parallel. With this metric, our algorithm improves or matches hand-coded parallel programs on shared-memory, bus-based parallel machines for eight of the nine programs in our test suite.

show abstract

“…We do not recommend that users convert their programs to serial versions before handing it to the compiler, but we use this methodology to assess the ability of the compiler to find, exploit, and optimize known parallelism. We used ParaScope [12], [27], an interactive parallelization tool, to systematically apply the transformations in the algorithm to the sequential programs. ParaScope implements dependence analysis, interprocedural analysis, and safe application of the loop transformations (tiling, interchange, fusion, and distribution), but not the interprocedural optimizations, nor the optimization algorithm itself.…”

Section: Introductionmentioning

confidence: 99%

A compiler optimization algorithm for shared-memory multiprocessors

McKinley

1998

IEEE Trans. Parallel Distrib. Syst.

View full text Add to dashboard Cite

This paper presents a new compiler optimization algorithm that parallelizes applications for symmetric, shared-memory multiprocessors. The algorithm considers data locality, parallelism, and the granularity of parallelism. It uses dependence analysis and a simple cache model to drive its optimizations. It also optimizes across procedures by using interprocedural analysis and transformations. We validate the algorithm by hand-applying it to sequential versions of parallel, Fortran programs operating over dense matrices. The programs initially were hand-coded to target a variety of parallel machines using loop parallelism. We ignore the user's parallel loop directives, and use known and implemented dependence and interprocedural analysis to find parallelism. We then apply our new optimization algorithm to the resulting program. We compare the original parallel program to the hand-optimized program, and show that our algorithm improves three programs, matches four programs, and degrades one program in our test suite on a shared-memory, bus-based parallel machine with local caches. This experiment suggests existing dependence and interprocedural array analysis can automatically detect user parallelism, and demonstrates that user parallelized codes often benefit from our compiler optimizations, providing evidence that we need both parallel algorithms and compiler optimizations to effectively utilize parallel machines.

show abstract

Analysis and transformation in an interactive parallel programming tool

Cited by 20 publications

References 36 publications

An integrated compile-time/run-time software distributed shared memory system

An integrated compile-time/run-time software distributed shared memory system

Evaluating automatic parallelization for efficient execution on shared-memory multiprocessors

A compiler optimization algorithm for shared-memory multiprocessors

Contact Info

Product

Resources

About