We study five popular auto-parallelization frameworks (Cetus, Par4all, Rose, ICC, and Pluto) andcompare them qualitatively as well as quantitatively. All the frameworks primarily deal with loop parallelization but differ in the techniques used to identify parallelization opportunities. Due to this variance, various aspects, such as certain loop transformations, are supported only in a few frameworks. The frameworks exhibit varying abilities in handling loop-carried dependence and, therefore, achieve different amounts of speedup on widely used PolyBench and NAS parallel benchmarks. In particular, Intel C Compiler (ICC) fares as an overall good parallelizer. Our study also highlights the need for more sophisticated analyses, user-driven parallelization, and meta-auto-parallelizer that provides combined benefits of various frameworks. KEYWORDS loop-carried dependence, loop transformations, privatization, vectorization
INTRODUCTIONQuest for performance has made multicore processors mainstream. It becomes vital for the programmers and algorithm developers to exploit these ubiquitous advanced architectures with parallel programming. Exploiting the potential of multicore processors through parallel programming is a significant challenge. Amid several approaches to tame this challenge, a promising and programmer-friendly approach is automatic parallelization. 1-3 Auto-parallelizers eliminate the need for a programmer to transform a sequential code into a parallel code, which is quite attractive.Source-to-source transformations with insertion of parallel directives are performed by auto-parallelizers, such as Cetus, 4-7 Par4all, 8,9 Pluto,10,11 Parallware, 12,13 Rose, 14,15 Intel C Compiler, 16 LLVM (Low Level Virtual Machine) Polly, 17 ParaWise, 18 ParaGraph, 19 SUIF (Stanford University Intermediate Format), 20-22 and Polaris. 23,24 Although these existing parallelizers offer considerable benefits, they still fall short of fully replacing the manual transformations. Several parallelizers do not utilize all the static information available, whereas several others fall short of modeling precision. It leads to either missed parallelism opportunities or unwanted parallelization of sequential codes. As a consequence, compared to the original sequential version, the auto-parallelized code may lead to parallelization overheads and may exhibit poorer performance. 25Earlier studies 26,27 of the parallelizing compilers using Perfect benchmarks performed a detailed analysis of the code restructuring techniques.The techniques include induction variable elimination, scalar expansion, forward substitution, strip mining, and loop interchange. The studies found that some of the programs showed improvements and that scalar expansion led to positive results. An effectiveness study on the Polaris compiler with OpenMP (Open Multi-Processing) 28 parallel code using Perfect benchmarks proclaimed a performance lag in small parallel loops. It also illustrated the importance of reduction operation, which resulted in a moderate (10%) performance...