Balancing processor loads and exploiting data locality in N-body simulations

Banicescu, Ioana; Hummel, Susan Flynn

doi:10.1145/224170.224306

Cited by 51 publications

(47 citation statements)

References 20 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Exemplar runtime systems implementing this approach are Zoltan [3], Chombo [15], and Charm++ [10]. Similar schemes have also been proposed and used in MPI applications [16,17]. This paper proposes concepts which build upon these existing frameworks in order to make decisions related to load balancing to get good performance.…”

Section: Previous Workmentioning

confidence: 99%

Automated Load Balancing Invocation Based on Application Characteristics

Menon

Jain

Zheng

et al. 2012

2012 IEEE International Conference on Cluster Computing

View full text Add to dashboard Cite

Abstract-Performance of applications executed on large parallel systems suffer due to load imbalance. Load balancing is required to scale such applications to large systems. However, performing load balancing incurs a cost which may not be known a priori. In addition, application characteristics may change due to its dynamic nature and the parallel system used for execution. As a result, deciding when to balance the load to obtain the best performance is challenging. Existing approaches put this burden on the users, who rely on educated guess and extrapolation techniques to decide on a reasonable load balancing period, which may not be feasible and efficient.In this paper, we propose the Meta-Balancer framework which relieves the application programmers of deciding when to balance load. By continuously monitoring the application characteristics and using a set of guiding principles, MetaBalancer invokes load balancing on its own without any prior application knowledge. We demonstrate that Meta-Balancer improves or matches the best performance that can be obtained by fine tuning periodic load balancing. We also show that in some cases Meta-Balancer improves performance by 18% whereas periodic load balancing gives only a 1.5% benefit.

show abstract

Section: Previous Workmentioning

confidence: 99%

Automated Load Balancing Invocation Based on Application Characteristics

Menon

Jain

Zheng

et al. 2012

2012 IEEE International Conference on Cluster Computing

View full text Add to dashboard Cite

show abstract

“…This layout is known in parallel computing as the Morton ordering and has been used for load balancing purposes [5,28,29,45,51,61]. It has also been applied for bandwidth reduction in information theory [6], for graphics applications [24,39], and for database applications [30].…”

Section: The Morton Layout Lmomentioning

confidence: 99%

“…Such restructuring techniques have been studied for pointer-based data structures, such as heaps [35,37,38] and trees [13]; for profile-driven object placement [8]; for matrices with special structure (e.g., banded matrices in LAPACK [1], or sparse matrices [20]); and in parallel computing [5,28,29,45,51,61]. But when working with general dense matrices in a uniprocessor environment, most programmers are reluctant to alter the default rowmajor or column-major linearization of multidimensional arrays that high-level languages provide, even when such ordering degrades cache performance.…”

Section: Introductionmentioning

confidence: 99%

Nonlinear array layouts for hierarchical memory systems

Chatterjee

Jain

Lebeck

et al. 1999

Proceedings of the 13th International Conference on Supercomputing

126

142

View full text Add to dashboard Cite

Programming languages that provide multidimensional arrays and a flat linear model of memory must implement a mapping between these two domains to order array elements in memory. This layout function is fixed at language definition time and constitutes an invisible, non-programmable array attribute. In reality, modern memory systems are architecturally hierarchical rather than flat, with substantial differences in performance among different levels of the hierarchy. This mismatch between the model and the true architecture of memory systems can result in low locality of reference and poor performance. Some of this loss in performance can be recovered by re-ordering computations using transformations such as loop tiling. We explore nonlinear array layout functions as an additional means of improving locality of reference. For a benchmark suite composed of dense matrix kernels, we show by timing and simulation that two specific layouts (4D and Morton) have low implementation costs (2-5% of total running time) and high performance benefits (reducing execution time by factors of 1.1-2.5); that they have smooth performance curves, both across a wide range of problem sizes and over representative cache architectures; and that recursion-based control structures may be needed to fully exploit their potential.

show abstract

“…This layout is known in parallel computing as the Morton ordering and has been used for load balancing purposes [7,25,26,33,36,40]. It has also been applied for bandwidth reduction in information theory [9], for graphics applications [20,30], and for database applications [27].…”

Section: Algorithm 6: Non-linear Array Layoutmentioning

confidence: 99%

Cache-efficient matrix transposition

Chatterjee

Sen

Proceedings Sixth International Symposium on High-Performance Computer Architecture. HPCA-6 (Cat. No.PR00550)

View full text Add to dashboard Cite

We investigate the memory system performance of several algorithms for transposing an N N matrix in-place, where N is large. Specifically, we investigate the relative contributions of the data cache, the translation lookaside buffer, register tiling, and the array layout function to the overall running time of the algorithms. We use various memory models to capture and analyze the effect of various facets of cache memory architecture that guide the choice of a particular algorithm, and attempt to experimentally validate the predictions of the model. Our major conclusions are as follows: limited associativity in the mapping from main memory addresses to cache sets can significantly degrade running time; the limited number of TLB entries can easily lead to thrashing; the fanciest optimal algorithms are not competitive on real machines even at fairly large problem sizes unless cache miss penalties are quite high; low-level performance tuning "hacks", such as register tiling and array alignment, can significantly distort the effects of improved algorithms; and hierarchical nonlinear layouts are inherently superior to the standard This work is supported in part by DARPA Grant DABT63-98-1-0001, NSF Grants CDA-97-2637 and CDA-95-12356, The University of North Carolina at Chapel Hill, Duke University, and an equipment donation through Intel Corporation's Technology for Education 2000 Program. The views and conclusions contained herein are those of the authors and should not be interpreted as representing the official policies or endorsements, either expressed or implied, of DARPA or the U.S. Government. canonical layouts (such as row-or column-major) for this problem.

show abstract

Balancing processor loads and exploiting data locality in N-body simulations

Cited by 51 publications

References 20 publications

Automated Load Balancing Invocation Based on Application Characteristics

Automated Load Balancing Invocation Based on Application Characteristics

Nonlinear array layouts for hierarchical memory systems

Cache-efficient matrix transposition

Contact Info

Product

Resources

About