This paper investigates the interaction between software pipelining and different software prefetching techniques for VLIW machines. It is shown that processor stalls due to memory dependences have a great impact into execution time. A novel heuristic is proposed and it is show to outperform previous proposals.
IntroductionSoftware pipelining represents a family of loop scheduling techniques that tries to exploit ILP by executing in parallel consecutive iterations of a loop. The most popular scheme is called modulo scheduling, and it consists of finding a fixed pattern of operations (of length II or initiation interval) from distinct iterations ([3]).Several schemes have been proposed in the literature with the goal of minimize the II and/or register pressure, but none of them has evaluated the effect of memory. When software pipelining is applied in VLIW architectures, where instruction latencies and scheduling are fixed at compile-time, execution time can be highly degraded due to the stall time provoked by dependences with memory instructions. Even if a nonblocking cache is used, true dependences with previous memory operations at a near distance 1 can make the processor to stall afterwards. The choice of scheduling all loads using the cache-miss latency requires considerable ILP and increases register pres-Different techniques to improve memory behavior exist and are well-known, and software prefetching is one of them. The main idea of this method is to bring to cache the data that will be used in a near future ([2]).In this paper we investigate the interactions between software pipelining and software prefetching in a VLIW architecture. Some alternatives to perform software prefetching are described, and a novel heuristic is presented. An evaluation in execution time terms is reported as well as some conclusions.1.Almost all modulo scheduling schemes use a fixed cache-hit latency for all memory operations
Software prefetching schemesSoftware prefetching is an effective technique to tolerate memory latency. When it is used with a nonblocking cache, this technique allows the processor to hide part or all the memory latency by overlapping the fetch of data and the computation.Software prefetching can be performed through two alternative schemes: binding and nonbinding prefetching. The first alternative, also known as early scheduling of memory operations, moves memory instructions away from those instructions that depend on them. The second alternative introduces in the code special instructions, which are called prefetch instructions. These are nonfaulting instructions that perform a cache lookup but do not modify any register.In the study presented in this paper we have evaluated two techniques of binding prefetching:• Early scheduling always (ESA): all memory operations of the loop are scheduled using cache-miss latency.• Early scheduling according to locality (ESL): schedule instructions that have some type of locality using the cache-hit latency and schedule the remaining ones using the cache-mis...