Load Balancing for Regular Meshes on SMPs with MPI

Kale, Vivek; Gropp, William

doi:10.1007/978-3-642-15646-5_24

Cited by 12 publications

(19 citation statements)

References 5 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…The above characterization confirms the benefits of the solution proposed in [6]. When a platform has several different noise events of different lengths, a dynamic scheduling strategy with an assortment of task granularities can be used.…”

Section: Architectures Consideredsupporting

confidence: 73%

“…Our initial work shows that the performance improvement of an application is relatively unnoticeable when running on a small number of nodes of a cluster, but becomes much more dramatic as we scale the application to a large number of nodes [6].…”

Section: Introductionmentioning

confidence: 99%

“…However, a preceding study that we did shows that through careful tuning to keep the cost of dynamic scheduling low while trying to minimize small-scale load imbalances within a node, we can achieve more predictable performance within a multi-core processor; this ultimately allows for better scalability as we run on a larger number of nodes [6]. The amount of dynamic scheduling we allow is proportional to the duration of the characteristic system noise of a machine.…”

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

Weighted locality-sensitive scheduling for mitigating noise on multi-core clusters

Kale

Bhatelé

Gropp

2011

2011 18th International Conference on High Performance Computing

Self Cite

View full text Add to dashboard Cite

Recent studies have shown that operating system (OS) interference, popularly called OS noise can be a significant problem as we scale to a large number of processors. One solution for mitigating noise is to turn off certain OS services on the machine. However, this is typically infeasible because full-scale OS services may be required for some applications. Furthermore, it is not a choice that an end user can make. Thus, we need an application-level solution.Building upon previous work that demonstrated the utility of within-node light-weight load balancing, we discuss the technique of weighted micro-scheduling and provide insights based on experimentation for two different machines with very different noise signatures. Through careful enumeration of the search space of scheduler parameters, we allow our weighted micro-scheduler to be dynamic, adaptive and tunable for a specific application running on a specific architecture. By doing this, we show how we can enable running scientific applications efficiently on a very large number of processors, even in the presence of noise.

show abstract

Section: Architectures Consideredsupporting

confidence: 73%

Section: Introductionmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Weighted locality-sensitive scheduling for mitigating noise on multi-core clusters

Kale

Bhatelé

Gropp

2011

2011 18th International Conference on High Performance Computing

Self Cite

View full text Add to dashboard Cite

show abstract

“…V. Kale et al suggested a hybrid static/dynamic approach in [16] that can be incorporated into current MPI implementations of structured grid codes to improve the load balancing of the initial static decompositions. This work embraces the fundamental principles advocated in [16], and applies it in the context of dense matrix factorizations. Xue et al introduced an approach in [24] that improves the data locality when executing loop iterations in codes.…”

Section: Related Workmentioning

confidence: 99%

“…In order to be scalable for future high-performance clusters(i.e. exascale), the code running within a node of a cluster must be tuned such that it achieves not simply "high-performance", but also "performance consistency" [14], [16]. Such static tuning techniques provide few guarantees on performance consistency.…”

Section: Introductionmentioning

confidence: 99%

Hybrid Static/dynamic Scheduling for Already Optimized Dense Matrix Factorization

Donfack

Grigori

Gropp

et al. 2012

2012 IEEE 26th International Parallel and Distributed Processing Symposium

Self Cite

View full text Add to dashboard Cite

Abstract-We present the use of a hybrid static/dynamic scheduling strategy of the task dependency graph for direct methods used in dense numerical linear algebra. This strategy provides a balance of data locality, load balance, and low dequeue overhead. We show that the usage of this scheduling in communication avoiding dense factorization leads to significant performance gains. On a 48 core AMD Opteron NUMA machine, our experiments show that we can achieve up to 64% improvement over a version of CALU that uses fully dynamic scheduling, and up to 30% improvement over the version of CALU that uses fully static scheduling. On a 16-core Intel Xeon machine, our hybrid static/dynamic scheduling approach is up to 8% faster than the version of CALU that uses a fully static scheduling or fully dynamic scheduling. Our algorithm leads to speedups over the corresponding routines for computing LU factorization in well known libraries. On the 48 core AMD NUMA machine, our best implementation is up to 110% faster than MKL, while on the 16 core Intel Xeon machine, it is up to 82% faster than MKL. Our approach also shows significant speedups compared with PLASMA on both of these systems.

show abstract

Toward a Standard Interface for User-Defined Scheduling in OpenMP

Kale

Iwainsky

Klemm

et al. 2019

OpenMP: Conquering the Full Hardware Spectrum

Self Cite

View full text Add to dashboard Cite

Parallel loops are an important part of OpenMP programs. Efficient scheduling of parallel loops can improve performance of the programs. The current OpenMP specification only offers three options for loop scheduling, which are insufficient in certain instances. Given the large number of other possible scheduling strategies, standardizing each of them is infeasible. A more viable approach is to extend the OpenMP standard to allow a user to define loop scheduling strategies within her application. The approach will enable standard-compliant application-specific scheduling. This work analyzes the principal components required by user-defined scheduling and proposes two competing interfaces as candidates for the OpenMP standard. We conceptually compare the two proposed interfaces with respect to the three host languages of OpenMP, i.e., C, C++, and Fortran. These interfaces serve the OpenMP community as a basis for discussion and prototype implementation supporting user-defined scheduling in an OpenMP library.

show abstract

Load Balancing for Regular Meshes on SMPs with MPI

Cited by 12 publications

References 5 publications

Weighted locality-sensitive scheduling for mitigating noise on multi-core clusters

Weighted locality-sensitive scheduling for mitigating noise on multi-core clusters

Hybrid Static/dynamic Scheduling for Already Optimized Dense Matrix Factorization

Toward a Standard Interface for User-Defined Scheduling in OpenMP

Contact Info

Product

Resources

About