Optimal loop scheduling for hiding memory latency based on two-level partitioning and prefetching

Wang, Zhong; O'Neil, Timothy W.; Sha, Edwin Hsing-Mean

doi:10.1109/78.960433

Cited by 6 publications

References 31 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

Compiler‐assisted dynamic scratch‐pad memory management with space overlapping for embedded systems

Yang¹,

Yan²,

Shao³

et al. 2010

Softw Pract Exp

View full text Add to dashboard Cite

Scratch-pad memory (SPM), a small, fast, software-managed on-chip SRAM (Static Random Access Memory) is widely used in embedded systems. With the ever-widening performance gap between processors and main memory, it is very important to reduce the serious off-chip memory access overheads caused by transferring data between SPM and off-chip memory. In this paper, we propose a novel compiler-assisted technique, ISOS (Iteration-access-pattern-based Space Overlapping SPM management), for dynamic SPM management with DMA (Direct Memory Access). In ISOS, we combine both SPM and DMA for performance optimization by exploiting the chance to overlap SPM space so as to further utilize the limited SPM space and reduce the number of DMA operations. We implement our technique based on IMPACT and conduct experiments using a set of benchmarks from DSPstone and Mediabench on the cycle-accurate VLIW simulator of Trimaran. The experimental results show that our technique achieves run-time performance improvement compared with the previous work. The average improvements are 13.15, 19.05, and 25.52% when the SPM sizes are 1KB, 512 bytes, and 256 bytes, respectively. 738 Y. YANG ET AL. in embedded systems. However, it poses a huge challenge for the compiler to fully explore SPM since it is completely controlled by software.To effectively manage SPM, two kinds of compiler-managed methods have been proposed: static methods [6,8,[10][11][12][13][14][15][16][17] and dynamic methods [1,[18][19][20][21][22][23][24][25][26][27][28][29][30]. Basically, based on the static SPM management, the content in SPM is fixed and is not changed during the running time of applications. With the dynamic SPM management, the content of SPM is changed during the running time based on the behavior of applications. For dynamic SPM management, it is important to select an effective approach to transfer data between off-chip memory and SPM. This is because the latency of off-chip memory access is about 10-100 times of that of SPM [1,6,18,30], and many embedded applications in image and video processing domains have significant data transfer requirements in addition to their computational requirements [9,31,32]. To reduce off-chip memory access overheads, the dedicated cost-efficient hardware, DMA (Direct Memory Access) [33], is used to transfer data. The focus of this paper is on how to combine SPM and DMA in dynamic SPM management for optimizing loops that are usually the most critical sections in some embedded applications, such as DSP and image processing.Our work is closely related to the work in [20,29,[34][35][36][37]. In [20], Kandemir et al. proposed a dynamic SPM technique for loops that can determine memory layouts and best loop access patterns, partition the available SPM space, and restructure the code for explicit data transfer. In [29], DMA is applied for data transfer between SPM and off-chip memory by applying graph coloring for SPM management. In [34,35], a two-level loop tiling technique with partitioning and pre-fetching is proposed for optimizi...

show abstract

Compiler‐assisted dynamic scratch‐pad memory management with space overlapping for embedded systems

Yang¹,

Yan²,

Shao³

et al. 2010

Softw Pract Exp

View full text Add to dashboard Cite

show abstract

Loop scheduling with memory access reduction subject to register constraints for DSP applications

et al. 2013

View full text Add to dashboard Cite

SUMMARYMemory accesses introduce big-time overhead and power consumption because of the performance gap between processors and main memory. This paper describes and evaluates a technique, loop scheduling with memory access reduction (LSMAR), that replaces hidden redundant load operations with register operations in loop kernels and performs partial scheduling for newly generated register operations subject to register constraints. By exploiting data dependence of memory access operations, the LSMAR technique can effectively reduce the number of memory accesses of loop kernels, thereby improving timing performance. The technique has been implemented into the Trimaran compiler and evaluated using a set of benchmarks from DSPstone and MiBench on the cycle-accurate simulator of the Trimaran infrastructure. The experimental results show that when the LSMAR technique is applied, the number of memory accesses can be reduced by 18.47% on average over the benchmarks when it is not applied. The measurements also indicate that the optimizations only lead to an average 1.41% increase in code size. With such small code size expansion, the technique is more suitable for embedded systems compared with prior work.

show abstract

Demonstration of latency reduction in electrical interconnections using optical fanout

Pappu

Apsel

2006 IEEE International Symposium on Circuits and Systems

View full text Add to dashboard Cite

Optimal loop scheduling for hiding memory latency based on two-level partitioning and prefetching

Cited by 6 publications

References 31 publications

Compiler‐assisted dynamic scratch‐pad memory management with space overlapping for embedded systems

Compiler‐assisted dynamic scratch‐pad memory management with space overlapping for embedded systems

Loop scheduling with memory access reduction subject to register constraints for DSP applications

Demonstration of latency reduction in electrical interconnections using optical fanout

Contact Info

Product

Resources

About