Exploring data transfer and storage issues is crucial to efficiently map data intensive applications (e.g., multimedia) onto programmable processors. Code transformations are used to minimise main memory bus load and hence also power and system performance. However, this typically incurs a considerable arithmetic overhead in the addressing and local control. For instance, memory optimising in-place and data-layout transformations add costly modulo and integer division operations to the initial addressing code. In this paper, we show how the cycle overhead can be almost completely removed. This is done according to a systematic methodology which is a combination of an algebraic transformation exploration approach for the (non)linear arithmetic on top of an efficient transformation technique for reducing the piece-wise linear indexing to linear pointer arithmetic. The approach is illustrated on a real-life medical application, using a variety of programmable processor architectures. Total gains in cycle count ranging between a factor 5 and 25 are obtained compared to conventional compilers.
FFTs are important modules in embedded telecom systems, many of which require low-power real-time implementations. This paper describes a technique for aggressively localizing data accesses in a (inverse) Fast Fourier 'hansformation at the source code level. The global I/O functionality is not modified and neither is the bittrue arithmetic behavior. 'Qpically 20 to 50% of the background memory accesses can be saved. A heavily parametrizable solution will be proposed which leads to a family of power optimized algorithm codes. Moreover, efficient coding details for specific instances are shown. CONTEXT AND MOTIVATIONA hardware or even an embedded software realization has to be power efficient in order to reduce the size of the chip packages (where it is embedded) or the battery (if used in a mobile application) 141. The power cost in data-dominated application is heavily dominated by storage and transfers of complex data types. This has been demonstrated both for custom [18] and for programmable instruction-set processors 1141. The reason is that an off-chip data transfer consumes about 33 times more power than a typical 16-bit arithmetic operation like an addition.In addition experiments indicate that the number of primitive arithmetic operations is typically only a few times higher than the number of data transfer operations to big signals. Combined this gives at least a factor of 10 in difference for the power consumption of data transfers in an unoptimized description compared to (equally unoptimized) arithmetic operations. Similar observations can be made for the FIT function in both software and especially hardware implementations. Hence the global data transfer and storage overhead should be reduced first in the system design trajectory.To perform the high performance data dominated function, fast busses and large memories with high access rates ftodto data-path are needed. Efficient implementation of the complex algorithms requires a global analysis of the critical sections and code transformations to eliminate or at least alleviate the impact of these bottlenecks.Many systematic software-oriented memory management approaches exist in literature but they do not focus on the combination of performance and overall power, 'partly sponsored by the Esprit ESDLPD project 25518 : DAB-LP 0-7803-5650-0/99/$10.00 0 1999 IEEE 63 5
The ever increasing gap between processor and memory speeds has motivated the design of embedded systems with deeper cache hierarchies. To avoid excessive miss rates, instead of using bigger cache memories and more complex cache controllers, program transformations have been proposed to reduce the amount of capacity and conflict misses. This is achieved however by complicating the memory index arithmetic code which results in performance degradation when executing the code on programmable processors with limited address capabilities. However, when these are complemented by high-level address code transformations, the overhead introduced can be largely eliminated at compile time. In this paper, the clear benefits of the combined approach is illustrated on two real-life applications of industrial relevance, using popular programmable processor architectures and showing important gains in energy (a factor 2 less) with a relatively small penalty in execution time (8-25%) instead of factors overhead without the address optimisation stage. The results of this paper leads to a systematic Pareto optimal trade-off (supported by tools) between memory power and CPU cycles which has up to now not been feasible for the targeted systems.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.