Time and Energy Performance of Parallel Systems with Hierarchical Memory

Accelerating the bi-objective optimization of applications for performance and energy is crucial to achieving energy efficiency objectives and meeting quality-of-service requirements in modern high-performance computing platforms and cloud computing infrastructures. In this work, we highlight the crucial challenges to accelerate model-based methods proposed for the bi-objective optimization of data-parallel applications for performance and energy that employ workload distribution between the executing processors as the decision variable. The methods solve unconstrained bi-objective optimization problems and take input, the processors' performance and energy profiles in the form of discrete functions of workload size, and output Pareto-optimal solutions (workload distributions), minimizing the execution time and the total energy consumption of computations during the parallel execution of the application. One of the challenges is the fast computation of Pareto-optimal solutions. We then formulate the bi-objective optimization problem of data-parallel applications for performance and energy through workload distribution on a cluster of p identical hybrid nodes, each containing h heterogeneous processors. The state-of-the-art algorithm for solving the problem is sequential and takes exorbitant execution times to find Pareto-optimal solutions for even moderate numbers of processors. We propose two algorithms that address this shortcoming. The first algorithm is an exact sequential algorithm that is more efficient and amenable to parallelization and achieves a complexity reduction of O(m × h) over the state-of-the-art sequential algorithm where m is the cardinality of the input discrete execution time and dynamic energy functions. The second algorithm is a parallel algorithm executed by q identical parallel processes that reduces the complexity of our proposed sequential algorithm by O(q) and therefore achieves a complexity reduction of O(m × h × q) over the state-of-the-art sequential algorithm. Finally, we experimentally analyze the practical efficacy of our proposed algorithms for two data-parallel applications, matrix multiplication and fast Fourier transform, on a heterogeneous hybrid node containing an Intel Haswell multicore CPU, an Nvidia k40c GPU, and an Nvidia P100 GPU and simulations of clusters of such hybrid nodes. The experiments demonstrate that our proposed algorithms provide tremendous speedups over state-of-the-art solutions.INDEX TERMS high-performance heterogeneous computing, energy-efficient computing, biobjective optimization, performance optimization, energy optimization, data-parallel applications, workload distribution I. INTRODUCTION Performance and energy are the two most important objectives for optimization in modern high per-formance computing (HPC) platforms, computational grids, data centers, and cloud computing infrastructures ([1],[2],[3],[4]). Achieving the energy efficiency objectives

show abstract

“…Research works [28], [29], [30] are analytical studies of bi-objective optimization for performance and energy.…”

Section: A System-level Methodsmentioning

confidence: 99%

Acceleration of Bi-Objective Optimization of Data-Parallel Applications for Performance and Energy on Heterogeneous Hybrid Platforms

2023

View full text Add to dashboard Cite

show abstract

“…They use iso-energy maps to study performance-energy trade-offs. Marszalkowski et al 69 analyzed the impact of memory hierarchies on time-energy trade-off in parallel computations, which are represented as divisible loads. They represent execution time and energy by two linear functions on problem size, one for in-core computations and the other for out-of-core computations.…”

Section: Application-level Methodsmentioning

confidence: 99%

Design of self‐adaptable data parallel applications on multicore clusters automatically optimized for performance and energy through load distribution

Manumachu

Lastovetsky

2018

Concurrency and Computation

View full text Add to dashboard Cite

Self-adaptability is a highly preferred feature in HPC applications. A crucial building block of a self-adaptable application is a data partitioning algorithm that must possess several essential qualities apart from low runtime and memory costs. On modern platforms composed of multicore CPU processors, data partitioning algorithms striving to solve the bi-objective optimization problem for performance and energy (BOPPE) face a formidable challenge. They must take into account the new complexities inherent in these platforms such as severe resource contention and non-uniform memory access (NUMA). Novel model-based methods and data partitioning algorithms have been proposed that address the challenge. However, these methods take as input full functional performance and energy models (FPM and FEM), which have prohibitively high model construction costs. Therefore, they are not suitable for employment in self-adaptable applications. In this paper, we present a self-adaptable data partitioning algorithm called ADAPTALEPH, which solves BOPPE on homogeneous clusters of multicore CPUs. Unlike the state-of-the-art solving BOPPE that take as inputs full FPM and FEM, it constructs partial FPM and FEM during its execution using all the available processors. It returns a locally Pareto-optimal set of solutions, which are the heterogeneous workload distributions that achieve inter-node optimization of data-parallel applications for performance and energy. We experimentally study the efficiency of ADAPTALEPH for three data-parallel applications, ie, matrix-vector multiplication, matrix-matrix multiplication, and fast Fourier transform, on a modern multicore CPU and simulations for homogeneous clusters of such CPUs. We demonstrate that the locally Pareto-optimal front approaches the globally Pareto-optimal front as the number of points in the partial discrete FPM and FEM functions are increased. The number of points in the partial FPM/FEM when the locally Pareto-optimal front becomes the globally Pareto-optimal front is considerably less than the number of points in the full FPM/FEM thereby suggesting development of methods that can leverage this finding to drastically reduce the model construction times.

show abstract

“…[44], [10], [45], [46] are analytical studies of bi-objective optimization for performance and energy. Choi et al [44] extend the energy roofline model by adding an extra parameter, power cap, to their execution time model.…”

Section: Notable Work Involving Performance and Energymentioning

confidence: 99%

“…Drozdowski et al [45] use iso-energy map, which are points of equal energy consumption in a multi-dimensional space of system and application parameters, to study performance-energy trade-offs. Marszakowski et al [46] analyze the impact of memory hierarchies on time-energy trade-off in parallel computations, which are represented as divisible loads.…”

Section: Notable Work Involving Performance and Energymentioning

confidence: 99%

Bi-objective Optimisation of Data-parallel Applications on Heterogeneous Platforms for Performance and Energy via Workload Distribution

Khaleghzadeh,

Fahad,

Shahid

et al. 2019

Preprint

View full text Add to dashboard Cite

Performance and energy are the two most important objectives for optimization on modern parallel platforms. Latest research demonstrated the importance of workload distribution as a key decision variable in the bi-objective optimization of data-parallel applications for performance and energy on homogeneous multicore CPU clusters. We show in this work that moving from single objective optimization for performance or energy to their bi-objective optimization on heterogeneous processors results in a tremendous increase in the number of optimal solutions (workload distributions) even for the simple case of linear performance and energy profiles. We then study full performance and energy profiles of two real-life data-parallel applications and find that they exhibit shapes that are non-linear and complex enough to prevent good approximation of them as analytical functions for input to exact algorithms or optimization softwares for determining the globally Pareto-optimal front. We, therefore, propose a solution method solving the bi-objective optimization problem on heterogeneous processors and comprising of two principal components. The first component is an efficient and exact global optimization algorithm. The algorithm takes as an input most general discrete performance and dynamic energy functions that accurately and realistically account for resource contention and NUMA inherent in modern parallel platforms. The algorithm is also used as a building block to solve the bi-objective optimization problem for performance and total energy. The second component is a novel methodology employed to build the discrete dynamic energy profiles of individual computing devices, which are input to the algorithm. The methodology is based purely on system-level measurements and addresses a fundamental challenge, which is to accurately model the energy consumption by a hybrid scientific data-parallel application executing on a heterogeneous HPC platform containing different computing devices such as CPU, GPU, and Xeon PHI. We experimentally analyse the proposed solution method using two data-parallel applications, matrix multiplication and 2D fast Fourier transform (2D-FFT), and show that our solution method determines a superior Pareto-optimal front containing all the load imbalanced solutions that are totally ignored by load balancing methods and best load balanced solutions.

show abstract

Time and Energy Performance of Parallel Systems with Hierarchical Memory

Cited by 18 publications

References 38 publications

Acceleration of Bi-Objective Optimization of Data-Parallel Applications for Performance and Energy on Heterogeneous Hybrid Platforms

Acceleration of Bi-Objective Optimization of Data-Parallel Applications for Performance and Energy on Heterogeneous Hybrid Platforms

Design of self‐adaptable data parallel applications on multicore clusters automatically optimized for performance and energy through load distribution

Bi-objective Optimisation of Data-parallel Applications on Heterogeneous Platforms for Performance and Energy via Workload Distribution

Contact Info

Product

Resources

About