Many-core hardware is targeted specifically at obtaining high performance, but reaching high performance is often challenging because hardware-specific details have to be taken into account. Although there are many programming systems that try to alleviate many-core programming, some providing a high-level language, others providing a low-level language for control, none of these systems have a clear and systematic methodology as a foundation. In this article, we propose stepwise-refinement for performance: a novel, clear, and structured methodology for obtaining high performance on many-cores. We present a system that supports this methodology, offers multiple levels of abstraction to provide programmers a trade-off between highlevel and low-level programming, and provides programmers detailed performance feedback. We evaluate our methodology with several widely varying compute kernels on two different many-core architectures: a Graphical Processing Unit (GPU) and the Xeon Phi. We show that our methodology gives insight in the performance, and that in almost all cases, we gain a substantial performance improvement using our methodology.Section 2 elaborates how various many-core programming approaches relate to MCL. In Section 3, we introduce our methodology stepwise-refinement for performance. Section 4 gives an overview of MCL and how our system implements our methodology. In Section 5, we give a detailed example of how the process of stepwise-refinement for performance takes place. Section 6 discusses several of the implementation techniques of our system. Section 7 evaluates our techniques for various well-known compute kernels. We conclude the article with a discussion and conclusion.
RELATED WORKThe challenges in many-core programming are widely recognized, and there are many approaches that try to alleviate it. This following section discusses the current status of programming manycores and identifies issues (summarized in Table I) that we try to address in our work. We distinguish three programming approaches: high-level programming, separation of concerns, and a tuning cycle approach. Section 2.2 discusses systems that influenced MCL. STEPWISE-REFINEMENT FOR PERFORMANCE 4517 2.1.3. Tuning cycle approach. The tuning cycle approach is an iterative process that usually consists of the following steps: evaluate the performance of an application, analyze the gathered results, and refactor the code to increase the performance. This approach usually fits low-level languages such as CUDA [28] or OpenCL [29] that offer programmers high degrees of control over the code. However, it can also be applied to directive-based programming systems, where in each step, more detailed directives are inserted [30][31][32][33][34][35]. Figure 15. Part of the hardware description mic.
Xeon PhiMic. Intel's Many Integrated Core (MIC) architecture contains several tens of in-order x86 cores with powerful vector units and several hardware threads connected through a ring network. The MIC exposes two layers of parallelism: vector in...