Coarse Grain Task Parallel Processing with Cache Optimization on Shared Memory Multiprocessor

Ishizaka, Kazuhisa; Obata, Motoki; Kasahara, Hironori

doi:10.1007/3-540-35767-x_23

Cited by 14 publications

(4 citation statements)

References 13 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…In the proposed cache optimization scheme, the task scheduler assigns macrotasks inside a DLG to the same processor as consecutively as possible [18] in addition to the "critical path" priority used by the both static and dynamic scheduling. Figure 5 shows a schedule when the proposed cache optimization is applied to macro-task graph in Figure 4(b) for a single processor.…”

Section: Consecutive Execution Of Data Localizable Groupmentioning

confidence: 99%

Performance of OSCAR Multigrain Parallelizing Compiler on SMP Servers

Ishizaka

Miyamoto

Shirako

et al. 2005

Lecture Notes in Computer Science

Self Cite

View full text Add to dashboard Cite

Abstract. This paper describes performance of OSCAR multigrain parallelizing compiler on various SMP servers, such as IBM pSeries 690, Sun Fire V880, Sun Ultra 80, NEC TX7/i6010 and SGI Altix 3700. The OS-CAR compiler hierarchically exploits the coarse grain task parallelism among loops, subroutines and basic blocks and the near fine grain parallelism among statements inside a basic block in addition to the loop parallelism. Also, it allows us global cache optimization over different loops, or coarse grain tasks, based on data localization technique with interarray padding to reduce memory access overhead. Current performance of OSCAR compiler is evaluated on the above SMP servers. For example, the OSCAR compiler generating OpenMP parallelized programs from ordinary sequential Fortran programs gives us 5.7 times speedup, in the average of seven programs, such as SPEC CFP95 tomcatv, swim, su2cor, hydro2d, mgrid, applu and turb3d, compared with IBM XL Fortran compiler 8.1 on IBM pSeries 690 24 processors SMP server. Also, it gives us 2.6 times speedup compare with Intel Fortran Itanium Compiler 7.1 on SGI Altix 3700 Itanium 2 16 processors server, 1.7 times speedup compared with NEC Fortran Itanium Compiler 3.4 on NEC TX7/i6010 Itanium 2 8 processors server, 2.5 times speedup compared with Sun Forte 7.0 on Sun Ultra 80 UltraSPARC II 4 processors desktop workstation, and 2.1 times speedup compare with Sun Forte compiler 7.1 on Sun Fire V880 UltraSPARC III Cu 8 processors server.

show abstract

Section: Consecutive Execution Of Data Localizable Groupmentioning

confidence: 99%

Performance of OSCAR Multigrain Parallelizing Compiler on SMP Servers

Ishizaka

Miyamoto

Shirako

et al. 2005

Lecture Notes in Computer Science

Self Cite

View full text Add to dashboard Cite

show abstract

“…In the proposed cache optimization scheme, a task scheduler for the coarse grain tasks assigns macro-tasks inside a DLG to the same processor as consecutively as possible [14] in addition to "critical path" priority. Fig.3 shows a schedule when the proposed cache optimization is applied to macro-task graph in Fig.2 …”

Section: Loop Aligned Decompositionmentioning

confidence: 99%

“…This paper proposes the padding scheme to reduce conflict misses to improve the performance of the coarse grain task parallel processing. In the cache optimization for coarse grain task parallel processing [14], at first, complier divides loops into smaller loops to fit data size accessed by loops to cache size. Next, the compiler analyzes parallelism among tasks including the divided loops using Earliest Executable Condition analysis and schedules tasks which shared the same data to the same processor so that the tasks can be executed consecutively accessing the shared data on the cache.…”

Section: Introductionmentioning

confidence: 99%

Cache Optimization for Coarse Grain Task Parallel Processing Using Inter-Array Padding

Ishizaka

Obata

Kasahara

2004

Languages and Compilers for Parallel Computing

Self Cite

View full text Add to dashboard Cite

Abstract. The wide use of multiprocessor system has been making automatic parallelizing compilers more important. To improve the performance of multiprocessor system more by compiler, multigrain parallelization is important. In multigrain parallelization, coarse grain task parallelism among loops and subroutines and near fine grain parallelism among statements are used in addition to the traditional loop parallelism. In addition, locality optimization to use cache effectively is also important for the performance improvement. This paper describes inter-array padding to minimize cache conflict misses among macro-tasks with data localization scheme which decomposes loops sharing the same arrays to fit cache size and executes the decomposed loops consecutively on the same processor. In the performance evaluation on Sun Ultra 80(4pe), OSCAR compiler on which the proposed scheme is implemented gave us 2.5 times speedup against the maximum performance of Sun Forte compiler automatic loop parallelization at the average of SPEC CFP95 tomcatv, swim hydro2d and turb3d programs. Also, OSCAR compiler showed 2.1 times speedup on IBM RS/6000 44p-270(4pe) against XLF compiler.

show abstract

“…Coarse-grained granulation [12] takes place when the time of the execution of data-processing-related operations in a program is longer than the total time of initializing these operations and transferring the data needed for the execution of these operations. This type of granulation corresponds with the nested-loop structure in which the outermost loop of the nest is parallel.…”

Section: Introductionmentioning

confidence: 99%

Statistical models to accelerate software development by means of iterative compilation

Kamińska

Bielecki

2016

csci

View full text Add to dashboard Cite

Minimization of data-processing time and reduction of software-development time are important practical problems to be tackled by modern computer science. This paper presents the authors' proposal of a family of statistical models for the estimation of program execution time, which is an approach focused on both of the above problems at the same time. The family consists of a general model and specific models and has been elaborated based on empirical data collected for pattern-program loops representing some arbitrarily selected features related to the program structure and the specificity of a program-execution environment. The paper presents steps to elaborate the aforementioned family as well as the results of the carried-out experimental research. The paper demonstrates how the elaborated models can be applied in iterative compilation for optimization purposes, allowing us to reduce the time of software development and produce code with minimal execution time.

show abstract

Coarse Grain Task Parallel Processing with Cache Optimization on Shared Memory Multiprocessor

Cited by 14 publications

References 13 publications

Performance of OSCAR Multigrain Parallelizing Compiler on SMP Servers

Performance of OSCAR Multigrain Parallelizing Compiler on SMP Servers

Cache Optimization for Coarse Grain Task Parallel Processing Using Inter-Array Padding

Statistical models to accelerate software development by means of iterative compilation

Contact Info

Product

Resources

About