Programmers today face a bewildering array of parallel programming models and tools, making it difficult to choose an appropriate one for each application. An increasingly popular programming model supporting structured parallel programming patterns in a portable and composable manner is the task-centric programming model. In this study, we compare several popular task-centric programming frameworks, including Cilk Plus, Threading Building Blocks, and various implementations of OpenMP 3.0. We have analyzed their performance on the Barcelona OpenMP Tasking Suite benchmark suite both on a 48-core AMD Opteron 6172 server and a 64-core TILEPro64 embedded many-core processor. Our results show that the OpenMP offers the highest flexibility for programmers, and this flexibility comes to a cost. Frameworks supporting only a specific and more restrictive model, such as Cilk Plus and Threading Building Blocks, are generally more efficient both in terms of performance and energy consumption. However, Intel's implementation of OpenMP tasks performs the best and closest to the specialized run-time systems. Mercurium is the source-to-source compiler used in conjunction with Nanos++. † † Because the (not-entirely unexpected) lack of a TILEPro64 back-end in the Intel compiler. ‡ ‡ This means that to synchronize with N tasks, the programmer need to explicitly use SYNC N times. PERFORMANCE IN TASK-CENTRIC PROGRAMMING FRAMEWORKS 13 Figure 5. Memory footprint for each run-time system implementation when normalized against a serial execution. The base is the serial executing compiled with GCC.example, Fibonacci and N-queens. Intel's TBB is the implementation, which has the largest memory footprint of all models.4.4.6. Embedded power measurements. Power consumption and energy has risen to become as important metrics as performance, in particular on embedded devices. We have measured the power and energy consumed by the application under different run-time systems on the TILEPro64. The reason we have not performed for the Opteron systems is because we found it to be much more difficult to isolate the effect of the processors and memory on that machine, although it was relatively straight-forward on the TILEPro64. We used a data acquisition device (NI USB-6210) to perform the power measurements on the TILEPro64. Sampling the power consumption did in no way interfere with the program execution, as it was performed on a separate computer connected to the data acquisition device. This set-up is similar to the ones used by Själander et al. [44]. The measured sampling frequency was 20 kHZ. We used a metric that we call speed-up power cost, which calculates the speed-up and application experiences for each added watt.
EXPERIMENTAL RESULTS AND DISCUSSIONThis section presents the experimental results obtained from micro-benchmarks and other benchmarks according to the methodology from the previous section.
Micro-benchmarksAll the micro-benchmark measurements were performed on the Opteron 6172 48-core system. A. PODOBAS, M. BRORSSON AND K.-F. FA...