Performance and Optimization of Data Prefetching Strategies in Scalable Multiprocessors

Saavedra, Rafael H.; Mao, Weiyu; Hwang, Kai

doi:10.1006/jpdc.1994.1102

Cited by 13 publications

(9 citation statements)

References 0 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…If this additional bandwidth is not available, prefetching may even be counter-productive because it overloads the network. (This has been pointed out in earlier publications, e.g., see [20].) In the full-prefetching strategy, overloading of the network is increased because all threads initiate the transfers right after receiving the invocation message sent by the caller of the operation.…”

Section: Successive Over-relaxationmentioning

confidence: 89%

Hawk: A Runtime System for Partitioned Objects*

Tanenbaum

Hassen

Bal

1997

Parallel Algorithms and Applications

View full text Add to dashboard Cite

Hawk is a language-independent runtime system for writing data-parallel programs using partitioned objects. A partitioned object is a multidimensional array of elements that can be partitioned and distributed by the programmer. The Hawk runtime system uses the user-defined partitioning of objects and a runtime mechanism based on Partition Dependency Graphs (PDGs) to increase the granularity of data transfers and consistency checks to a partition. Hawk further optimizes the execution of parallel operations by pre fetching data and overlapping communication and computation.We first present the partitioned object model. Then, we give an overview of Hawk and describe how it uses PDGs to reduce communication overhead and optimize parallel operations. Finally, we discuss the effectiveness of our optimization technique with two applications written on top of Hawk.

show abstract

Section: Successive Over-relaxationmentioning

confidence: 89%

Hawk: A Runtime System for Partitioned Objects*

Tanenbaum

Hassen

Bal

1997

Parallel Algorithms and Applications

View full text Add to dashboard Cite

show abstract

“…The remote memory access latency could be modified to reflect better the latency variations of a specific interconnection network. Previous studies reported in [4] and [26] have shown that network contention reduces the overall gains from supporting multiple outstanding requests in the network. Hence we expect slight drops in performance of all memory consistency models once these negative effects are included.…”

Section: ' Imentioning

confidence: 99%

“…Relaxed memory consistency allows reordering of memory events and buffering or pipelining of remote memory accesses [l], [13], [15], [16], [29], [8]. Data prefetching attempts to hide long read latency by issuing read requests well ahead of time, with the expectation that the data will be available in the cache when it is referenced [21], [26]. Multithreading attempts to hide the long latency by context switching between several active threads, thus allowing the processor to perform useful work while waiting for remote requests or synchronization faults to complete [2], [3], [4], [241, [251, [281. Most of these studies are based on simulation results.…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Performance analysis of four memory consistency models for multithreaded multiprocessors

Chong

Hwang

1995

IEEE Trans. Parallel Distrib. Syst.

View full text Add to dashboard Cite

Stochastic timed Petri nets are developed to evaluate the relative performance of distributed shared memory models for scalable multiprocessors, using multithreaded processors as building blocks. Four shared memory models are evaluated: the Sequential Consistency (SC) model by Lamport (1979), the Weak Consistency (WC) model by Dubois et al. (1986), the Processor Consistency (PC) model by Goodman (1989), and the Release Consistency (RC) model by Gharachorloo et al. (1990). We assumed a scalable network with a sufficient bandwidth to absorb the increased traffic from multithreading, coherent caches, and memory event reordering. The embedded Markov chains are solved to reveal the performance attributes. Under saturated conditions, we find that multithreading contributes more than 50% of the performance improvement, while the improvement from memory consistency models varies between 20% to 40% of the total performance gain. Petri net models are effective to predict the performance of processors with a larger number of contexts than that can be simulated in previous benchmark studies. The accuracy of these memory performance models was validated with the simulation results from Stanford University. Our analytical results reveal the lowest performance of the SC model amongst four memory consistency models. The PC model requires to use larger write buffers, while the WC and RC models require smaller write buffers. The PC model may perform even lower than the SC model, if a small buffer was used. The performance of the WC model depends heavily on the synchronization rate in user code. For a low synchronization rate, the WC model performs as well as the RC model. With sufficient multithreading and network bandwidth, the RC model shows the best performance among the four models. Furthermore, we discovered that cache interferences cause very little performance degradation in all relaxed memory consistency models; as long as the network is contention-free even when multithreading has saturated the system.

show abstract

“…Characterizations of the distance between the data movement initiation and the use of the data have also been based on program averages, like the cache miss ratio used by [1] to model coarse grain multithreading and the prefetch distance used by [10] to model the effectiveness of prefetching techniques. The write-run metric proposed by [5] is based on the average number of writes by a processor to a shared data item before an access by another processor.…”

Section: Introductionmentioning

confidence: 99%