This paper studies and compares the use of data prefetching and an alternative mechanism, data forwarding, for reducing memory latency due to interprocessor communication in cache coherent, shared memory multiprocessors. Two multiprocessor prefetching algorithms are presented and compared. A simple blocked vector prefetching algorithm, considerably less complex than existing software pipelined prefetching algorithms, is shown to be effective in reducing memory latency and increasing performance. A Forwarding Write operation is used to evaluate the effectiveness of forwarding. The use of data forwarding results in significant performance improvements over data prefetching for codes exhibiting less spatial locality. Algorithms for data prefetching and data forwarding are implemented in a parallelizing compiler. Evaluation of the proposed schemes and algorithms is accomplished via execution-driven simulation of large, optimized, parallel numerical application codes with loop-level and vector parallelism. More data, discussion, and experiment details can be found in [1].
Scalable shared-memory multiprocessors are often slowed down by long-latency memory accesses. One way to cope with this problem is to use data forwarding to overlap memory accesses with computation.With data forwarding, when a processor produces a datum, in addition to updating its cache, it sends a copy of the datum to the caches of the processors that the compiler identified as consumers of it. As a result, when the consumer processors access the datum, they find it in their caches.This paper addresses two main issues. First, it presents a framework for a compiler algorithm for forwarding. Second, using address traces, it evaluates the performance impact of different levels of support for forwarding.Our simulations of a 32-processor machine show that, on average, a slightly-optimistic support for forwarding speeds up five applications by 50'% for large caches and 3070 for small caches. For large caches, most read sharing misses can be eliminated, whale for small caches, forwarding rarely increases the number of conflict misses. Overall, support for forwarding in shared-memory multiprocessors promises to deliver good application speedups.
EPG-sim is a newly-developed set of tools that performs execution-driven critical path simulation, trace generation, and simulation for serial, optimistically parallelized, and parallel application codes. These capabilities are integrated within a single framework through the use of intelligent source-level instrumentation.The ability to perform execution-driven simulations driven by optimistically paralielized codes, the ability to execute these simulations on parallel hosts, the use of source-level instrumentation, and the integration of the capabilities provided by EPG-sim are among the novel contributions of this work. EPG-sim has important uses in studying parallel architectures, parallelizing compilers, and parallel applications.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.