Balancing HPC applications through smart allocation of resources in MT processors

Boneti, Carlos; Gioiosa, Roberto; Cazorla, Francisco J.; Corbalan, Julita; Labarta, Jesús; Valero, Mateo

doi:10.1109/ipdps.2008.4536293

Cited by 18 publications

(36 citation statements)

References 22 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…STM 2 and all the applications are compiled with GCC 4.3.4 with optimization level O3; the results reported for each application are the average of 25 runs. In order to use all hardware priority levels, all tests are performed on a custom version of the Linux 2.6.33 kernel patched with the HMT patch [3]. In all the experiments, Eigenbench and STAMP applications use all the 32 available hardware threads: 16 application threads and 16 auxiliary threads.…”

Section: Resultsmentioning

confidence: 99%

“…As Table 1 shows, not all hardware thread priority values can be set by applications: user software can only set priority levels 2, 3, 4; the operating system (OS) can set 6 out of 8 levels, from 1 to 6; the Hypervisor can span the whole range of priorities. In order to use all possible levels of priorities, a special Linux 2.6.33 kernel patched with the Hardware Managed Threads priority (HMT) patch [2,3,4] is required. This custom kernel provides two interfaces (a sysfs and a system call) through which the users can set the current hardware thread priority, including the ones that require OS or Hypervisor privilege (the OS issues a special Hypervisor call to set priority 0 and 7).…”

Section: Hardware Resource Partitioningmentioning

confidence: 99%

“…Other researchers [23] have also investigated the effect of hardware thread priorities on the execution time of co-scheduled application pairs on a trace-driven simulator of the POWER5 processor. Moreover, in a follow-up work, Boneti et al used hardware prioritization to transparently balance high performance computing applications [3,4], achieving up to 18% performance improvement.…”

Section: Related Workmentioning

confidence: 99%

See 2 more Smart Citations

Enhancing the performance of assisted execution runtime systems through hardware/software techniques

Kestor

Gioiosa

Unsal

et al. 2012

Proceedings of the 26th ACM International Conference on Supercomputing

Self Cite

View full text Add to dashboard Cite

To meet the expected performance, future exascale systems will require programmers to increase the level of parallelism of their applications. Novel programming models simplify parallel programming at the cost of increasing runtime overheard. Assisted execution models have the potential of reducing this overhead but they generally also reduce processor utilization.We propose an integrated hardware/software solution that automatically partition hardware resources between application and auxiliary threads. Each system level performs well-defined tasks efficiently: 1) the runtime system is enriched with a mechanism that automatically detects computing power requirements of running threads and drives the hardware actuators; 2) the hardware enforces dynamic resource partitioning; 3) the operating system provides an efficient interface between the runtime system and the hardware resource allocation mechanism. As a test case, we apply this adaptive approach to STM 2 , an software transactional memory system that implements the assisted execution model.We evaluate the proposed adaptive solution on an IBM POWER7 system using Eigenbench and STAMP benchmark suite. Results show that our approach performs equal or better than the original STM 2 and achieves up to 65% and 86% performance improvement for Eigenbench and STAMP applications, respectively.

show abstract

Section: Resultsmentioning

confidence: 99%

Section: Hardware Resource Partitioningmentioning

confidence: 99%

Section: Related Workmentioning

confidence: 99%

See 1 more Smart Citation

Enhancing the performance of assisted execution runtime systems through hardware/software techniques

Kestor

Gioiosa

Unsal

et al. 2012

Proceedings of the 26th ACM International Conference on Supercomputing

Self Cite

View full text Add to dashboard Cite

show abstract

“…According to the classification in [3], there are two main classes of load imbalance due to causes that become apparent only during the application's execution: (i) Intrinsic load imbalance is caused by characteristics, which are intrinsic to the application, such as the input data. For example, sparse matrix computations heavily depend on the number of non-zero values in the matrix; the convergence time of iterative methods that approximate the solution of a problem may change for different domains of the modeled space.…”

Section: Introductionmentioning

confidence: 99%

“…For example, the OS may decide to run another process (say a kernel daemon) in place of the process running on a CPU. Also, extrinsic load imbalance may be caused by thread contention for processor's shared resources; this may be particularly true in the case of SMT architectures, where threads share and compete for most of the processor's resources [3]. Clearly, there is nothing that the application programmer could do a priori to prevent extrinsic load imbalance.…”

Section: Introductionmentioning

confidence: 99%

Load balancing using dynamic cache allocation

Moretó

Cazorla

Sakellariou

et al. 2010

Proceedings of the 7th ACM International Conference on Computing Frontiers

Self Cite

View full text Add to dashboard Cite

Supercomputers need a huge budget to be built and maintained. To maximize the usage of their resources, application developers spend time to optimize the code of the parallel applications and minimize execution time. Despite this effort, load imbalance still arises in many optimized applications due to causes not controlled by the application developer, resulting in significant performance degradation and waste of CPU time. If the nodes of the supercomputer use chip multiprocessors, this problem may become even worse, as the interaction between different threads inside the chip may affect their performance in an unpredictable way.Although there are many techniques to address load imbalance at run-time, as it happens, these techniques may not be particularly effective when the cause of the imbalance is due to the performance sensitivity of the parallel threads when accessing a shared cache. To this end, we present a novel run-time mechanism, with minimal hardware, that automatically tries to balance parallel applications using dynamic cache allocation. The mechanism detects which applications may be sensitive to cache allocation and reduces imbalance by assigning more cache space to the slowest threads. The efficiency of our proposed mechanism is demonstrated with both synthetic workloads and a realworld parallel application. In the former case, we reduce the execution time by up to 28.9%; in the latter case, our proposal reduces the imbalance of a non-optimized version of the application to the values obtained with a hand-tuned version of the same application.

show abstract

Stragglers in Distributed Matrix Multiplication

Nissim,

Schwartz

2023

Lecture Notes in Computer Science

View full text Add to dashboard Cite

Balancing HPC applications through smart allocation of resources in MT processors

Cited by 18 publications

References 22 publications

Enhancing the performance of assisted execution runtime systems through hardware/software techniques

Enhancing the performance of assisted execution runtime systems through hardware/software techniques

Load balancing using dynamic cache allocation

Stragglers in Distributed Matrix Multiplication

Contact Info

Product

Resources

About