MERPSYS: An environment for simulation of parallel application execution on large scale HPC systems

Drypczewski

2020

Scientific Programming

Self Cite

This paper provides a review of contemporary methodologies and APIs for parallel programming, with representative technologies selected in terms of target system type (shared memory, distributed, and hybrid), communication patterns (one-sided and two-sided), and programming abstraction level. We analyze representatives in terms of many aspects including programming model, languages, supported platforms, license, optimization goals, ease of programming, debugging, deployment, portability, level of parallelism, constructs enabling parallelism and synchronization, features introduced in recent versions indicating trends, support for hybridity in parallel execution, and disadvantages. Such detailed analysis has led us to the identification of trends in high-performance computing and of the challenges to be addressed in the near future. It can help to shape future versions of programming standards, select technologies best matching programmers’ needs, and avoid potential difficulties while using high-performance computing systems.

Section: Challenges In Modern High-performance Computingmentioning

confidence: 99%

Survey of Methodologies, Approaches, and Challenges in Parallel Programming Using High-Performance Computing Systems

Czarnul

Drypczewski

2020

Scientific Programming

Self Cite

“…algorithm: ring, Rabenseifner, pre-reduced ring (PRR) and sorted linear tree (SLT); size: (of data vector) 128 K, 512 K, 1 M, 2 M, 4 M, 8 M of floats (4 bytes long); mode: (of process delay) one-late (where only one process is delayed by maxDelay) and rand-late (where all processes are delayed randomly up to maxDelay); maxDelay: (of processes arrival times) 0, 1, 5, 10, 50, 100, 500, 1000 ms; P : (number of processes/nodes) 4,6,8,10,12,16,20,24,28,32,36,40,44,48; N : (number of iterations) 64-256, depending on maxDelay (more for lower delay); Table 3 presents the results of the benchmark execution for 1 M of floats of reduced data, 1 Gbps Ethernet network, where only one process was delayed on 48 nodes in a cluster environment of Tryton [14] HPC computer. The results are presented as absolute values of average elapsed time:ē alg and speedup: s alg , in comparison with ring algorithm s alg =ē rinḡ e alg , where alg is the evaluated algorithm.…”

Section: Environment and Test Setupmentioning

confidence: 99%

Improving all-reduce collective operations for imbalanced process arrival patterns

2018

J Supercomput

Self Cite

Two new algorithms for the all-reduce operation, optimized for imbalanced process arrival patterns (PAPs) are presented: (i) sorted linear tree (SLT), (ii) pre-reduced ring (PRR) as well as a new way of on-line PAP detection, including process arrival time (PAT) estimations and their distribution between cooperating processes was introduced. The idea, pseudo-code, implementation details, benchmark for performance evaluation and a real case example for machine learning are provided. The results of the experiments were described and analyzed, showing that the proposed solution has high scalability and improved performance in comparison with the usually used ring and Rabenseifner algorithms.Collective communication [2] is frequently used by the programmers and designers of parallel programs, especially in high performance computing (HPC) applications related to scientific simulations and data analysis, including machine learning calculations. Usually, collective operations, e.g. implemented in MPI [6], are based on algorithms optimized for the simultaneous entering of all participants into the operation, i.e. they do not take into consideration possible differences in process arrival times (PATs), thus, in real environment, where such imbalances are ubiquitous, they can have significant performance

“…• analysis of the trade-off to find out potential points where values for measures incorporating execution time and energy used would be optimal for a specific application, • benchmarking other applications, especially those that take more power from our testbed systems, • power-aware modeling of compute devices in frameworks for simulation of application runs in high performance computing environments such as MERPSYS [23], • development of a tool for automatic detection of the optimal power settings for the aforementioned time-energy measures using historical data (e.g. via machine learning), • proposing a new method for minimizing the electrical energy usage dynamically at runtime for various HPC/cloud workloads [24].…”

Section: Final Remarks and Future Workmentioning

confidence: 99%

Analyzing energy/performance trade-offs with power capping for parallel applications on modern multi and many core processors

Krzywaniak

Annals of Computer Science and Information Systems

Czarnul

2018

Self Cite

In the paper we present extensive results from analyzing energy/performance trade-offs with power capping observed on four different modern CPUs, for three different parallel applications such as 2D heat distribution, numerical integration and Fast Fourier Transform. The CPU tested represent both multi-core type CPUs such as Intel R Xeon R E5, desktop and mobile i7 as well as many-core Intel R Xeon Phi TM x200 but also server, desktop and mobile solutions used widely nowadays. We show that using enforced power caps we can find points of lower than default energy consumption but mostly for desktop and mobile solutions at the cost of increased execution time. We show with particular numbers how energy consumed, power consumption and execution time change for the point of minimum energy used versus the default configuration with no power limit, for each application and each tested CPU.