Direct parallelization of call statements

Triolet, Rémi; Irigoin, François; Feautrier, Paul

doi:10.1145/12276.13329

Cited by 122 publications

(44 citation statements)

References 8 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…After a suitable grain size is obtained, the compiler must determine any data dependences between pairs of nodes by array section analysis to generate the macro task graph. Where dependences exist, the compiler must further determine the reuse between the two nodes by applying array region analysis [9,10,11]. Thus a macro task graph G(N,E) consists of a set of nodes N = {n 1 , n 2 , ..,n m } connected by a set of edges E, each of which is denoted by e ij .…”

Section: Transformation Of Openmp To Macro-task Graphmentioning

confidence: 99%

Asynchronous Execution of OpenMP Code

Weng

Chapman

2003

Lecture Notes in Computer Science

View full text Add to dashboard Cite

Abstract. This paper presents the transformation of OpenMP source code to a Macro-Task Graph, an internal representation of the parallel program as a collection of tasks, which later can be asynchronously scheduled for out-of-order execution and optimized for locality reuse. The transformation is based on array region analysis. We also show the potential benefits of targeting OpenMP code to a macro-task graph, instead of directly generating a multi-threaded program. We show experimental results for a Jacobi kernel and part of the POP code in OpenMP and compiled traditionally versus macro-dataflow execution model using the SMARTS runtime system on SGI Origin 2000.

show abstract

Section: Transformation Of Openmp To Macro-task Graphmentioning

confidence: 99%

Asynchronous Execution of OpenMP Code

Weng

Chapman

2003

Lecture Notes in Computer Science

View full text Add to dashboard Cite

show abstract

“…We found that only two benchmarks (btrix and phods) take advantage of this optimization. However, we believe that a more sophisticated (inter-procedural [19]) analysis can find more opportunities for this optimization in large codes that use many temporary arrays. Figure 15 summarizes the energy impact of our optimizations in each code.…”

Section: Evaluation Of Loop Splittingmentioning

confidence: 99%

Energy-oriented compiler optimizations for partitioned memory architectures

Delaluz

Kandemir

Narayanan

et al. 2000

Proceedings of the International Conference on Compilers, Architectures, and Synthesis for Embedded Systems - CASES '00

View full text Add to dashboard Cite

Due to low power requirements of many embedded/portable devices such as mobile phones and laptop computers and dramatic increases in clock frequencies of general-purpose processors, lowpower software technology is becoming increasingly important in system design. Many applications from image and video processing as well as from dense linear algebra are array-dominated and data-intensive, thereby spending a major portion of their execution time and energy in the memory subsystem. This paper presents a compiler-based optimization framework that targets reducing the energy consumption in a partitioned off-chip memory architecture that contains multiple memory banks by organizing the order of computations and the layout of data. The optimizations considered in this work take advantage of low-power operating modes and the partitioned (multi-bank) structure of the off-chip memory. Our preliminary experiments show that the proposed framework improves memory energy by up to 86% over a scheme that keeps all the memory banks in the active (fully-operational) operating mode all the time, and up to 70% over a scheme that utilizes low-power operating modes without doing any loop and data optimizations.

show abstract

“…Sections represent a restricted set of the most commonly occurring array access patterns; single elements, rows, columns, grids, and their higher dimensional analogs. The various approaches to interprocedural array side-effect analysis must make tradeoffs between precision and efficiency [3,4,10,16,23]. Section analysis loses precision because it only represents a selection of array structures and it merges sections for all references to a variable in a procedure into a single section.…”

Section: Interprocedural Analysismentioning

confidence: 99%

Evaluating automatic parallelization for efficient execution on shared-memory multiprocessors

McKinley

1994

Proceedings of the 8th International Conference on Supercomputing - ICS '94

View full text Add to dashboard Cite

We present a parallel code generation algorithm for complete applications and a new experimental methodology that tests the efficacy of our approach. The algorithm optimizes for data locality and parallelism, reducing or eliminating false sharing. It also uses interprocedural analysis and transformations to improve the granularity of parallelism. Although the individual components of the algorithm have been published previously, their coordination is unique to this paper. For experimental validation, we do not attempt to parallelize 'dusty deck' programs where many have tried and failed. Instead, we collect programs where the users tried to achieve excellent parallel performance. We apply our optimizations to sequential versions of these programs, i.e., the compiler was required to use its analysis and algorithms to parallelize the program and could not rely on user assertions that for example, a loop is parallel. With this metric, our algorithm improves or matches hand-coded parallel programs on shared-memory, bus-based parallel machines for eight of the nine programs in our test suite.

show abstract

Direct parallelization of call statements

Cited by 122 publications

References 8 publications

Asynchronous Execution of OpenMP Code

Asynchronous Execution of OpenMP Code

Energy-oriented compiler optimizations for partitioned memory architectures

Evaluating automatic parallelization for efficient execution on shared-memory multiprocessors

Contact Info

Product

Resources

About