Chris Holt scite author profile

We present parallel versions of a representative N-body application that uses Greengard and Rokhlin's adaptive Fast Multipole Method (FMMJ While parallel implementations of the umform FMM are straightforward and have been developed on alfferent architectures, the aalzptive version complicates the task of obtaining eflective parallel peflormance owing to the nonunz~onn and dynamically changing nature of the problem &mains to which it is applied.We propose and evaluate two techniques for providing load balancing and data locality, both of which take advantage of key insights into the method and its typical applications.Using the better of these techm"ques, we demonstrate 45-fold speedups on galactic sinudations on a 48-processor Stanford DASH machine, a state-of-the-art shared address space multiprocessor even for relatively small problems.We also show good speedups on a 2-ring Kendall Square Research KSR-I. Finally we summarize some key architectural implications of this important computational method.

show abstract

Latency, occupancy, and bandwidth in dsm multiprocessors: a performance evaluation

Chaudhuri

¹

,

Heinrich

²

,

Holt³

et al. 2003

IEEE Trans. Comput.

View full text Add to dashboard Cite

Abstract-While the desire to use commodity parts in the communication architecture of a DSM multiprocessor offers advantages in cost and design time, the impact on application performance is unclear. We study this performance impact through detailed simulation, analytical modeling, and experiments on a flexible DSM prototype, using a range of parallel applications. We adapt the logP model to characterize the communication architectures of DSM machines. The l (network latency) and o (controller occupancy) parameters are the keys to performance in these machines, with the g (node-to-network bandwidth) parameter becoming important only for the fastest controllers. We show that, of all the logP parameters, controller occupancy has the greatest impact on application performance. Of the two contributions of occupancy to performance degradation-the latency it adds and the contention it induces-it is the contention component that governs performance regardless of network latency, showing a quadratic dependence on o. As expected, techniques to reduce the impact of latency make controller occupancy a greater bottleneck. Surprisingly, the performance impact of occupancy is substantial, even for highly-tuned applications and even in the absence of latency hiding techniques. Scaling the problem size is often used as a technique to overcome limitations in communication latency and bandwidth. Through experiments on a DSM prototype, we show that there are important classes of applications for which the performance lost by using higher occupancy controllers cannot be regained easily, if at all, by scaling the problem size.

show abstract

Application and architectural bottlenecks in large scale distributed shared memory machines

Holt

¹

,

Singh

²

,

Hennessy

³

1996

View full text Add to dashboard Cite

Many of the programming challenges encountered in small to moderate-scale hardware cache-coherent shared memory machines have been extensively studied. While work remains to be done, the basic techniques needed to efficiently program such machines have been well explored. Recently, a number of researchers have presented architectural techniques for scaling a cache coherent shared address space to much larger processor counts. In this paper, we examine the extent to which applications can achieve reasonable performance on such large-scale, cache-coherent, distributed shared address space machines, by determining the problems sizes needed to achieve a reasonable level of efficiency. We also look at how much programming effort and optimization is needed to achieve high efficiency, beyond that needed at small processor counts. For each application, we discuss the main architectural bottlenecks that prevent smaller problem sizes or less optimized programs from achieving good efficiency. Our results show that while there are some applications that either do not scale or must be heavily optimized to do so, for most of the applications we studied it is not necessary to heavily modify the code or restructure algorithms to scale well upto several hundred processors, once the basic techniques for load balancing and data locality are used that are needed for small-scale systems as well. Programs written with some care perform well without substantially compromising the ease of programming advantage of a shared address space, and the problem sizes required to achieve good performance are surprisingly small. It is important to be careful about how data structures and layouts interact with system granularities, but these optimization are usually needed for moderate-scale machines as well.

show abstract

Application and architectural bottlenecks in large scale distributed shared memory machines

Holt

¹

,

Singh

²

,

Hennessy

³

1996

SIGARCH Comput. Archit. News

View full text Add to dashboard Cite

Many of the programming challenges encountered in small to moderate-scale hardware cache-coherent shared memory machines have been extensively studied. While work remains to be done, the basic techniques needed to efficiently program such machines have been well explored. Recently, a number of researchers have presented architectural techniques for scaling a cache coherent shared address space to much larger processor counts. In this paper, we examine the extent to which applications can achieve reasonable performance on such large-scale, cache-coherent, distributed shared address space machines, by determining the problems sizes needed to achieve a reasonable level of efficiency. We also look at how much programming effort and optimization is needed to achieve high efficiency, beyond that needed at small processor counts. For each application, we discuss the main architectural bottlenecks that prevent smaller problem sizes or less optimized programs from achieving good efficiency. Our results show that while there are some applications that either do not scale or must be heavily optimized to do so, for most of the applications we studied it is not necessary to heavily modify the code or restructure algorithms to scale well upto several hundred processors, once the basic techniques for load balancing and data locality are used that are needed for small-scale systems as well. Programs written with some care perform well without substantially compromising the ease of programming advantage of a shared address space, and the problem sizes required to achieve good performance are surprisingly small. It is important to be careful about how data structures and layouts interact with system granularities, but these optimization are usually needed for moderate-scale machines as well.

show abstract

Chris Holt

Load Balancing and Data Locality in Adaptive Hierarchical N-Body Methods: Barnes-Hut, Fast Multipole, and Radiosity

A parallel adaptive fast multipole method

Latency, occupancy, and bandwidth in dsm multiprocessors: a performance evaluation

Application and architectural bottlenecks in large scale distributed shared memory machines

Application and architectural bottlenecks in large scale distributed shared memory machines

Contact Info

Product

Resources

About