Code generation for embedded processors opens up the possibility for several performance optimization techniques that have been ignored by traditional compilers due to compilation time constraints. We present techniques that take into account the parameters of the data caches for organizing scalar and array variables declared in embedded code into memory, with the objective of improving data cache performance. We present techniques for clustering variables to minimize compulsory cache misses, and for solving the memory assignment problem to minimize conflict cache misses. Our experiments with benchmark code kernels from DSP and other domains on the CW4001 embedded processor from LSI Logic indicate significant improvements in data cache performance by the application of our memory organization technique. the performance improvement of applications running on general-purpose embedded processors. In a general-purpose embedded processor, the architecture more closely resembles traditional processors, with the following well-known exceptions: (1) we now frequently have only a single application running on the processor, and (2) we are permitted longer analysis and compilation times for the application. These features raise many interesting problems that are unique to the embedded processor environment, and that have not been addressed by traditional compilers (or have been addressed only partially), largely due to restrictions on compilation times permitted.Generation of efficient code for embedded processors has been the subject of recent investigation [Goosens et al. 1990;Paulin et al. 1995;Araujo et al. 1995]. Optimization techniques that improve the performance of application programs by exploiting the irregular architectures of some embedded DSP processors and other application-specific processors have been reported Sudarsanam and Malik 1995;Goosens et al. 1990;Liem et al. 1994]. Research efforts have also focused on retargetable code generation, with an attempt to generate code from the same behavioral specification, into different target embedded processors, using a suitable processor model [Lanneer et al. 1995;Schenk 1995].An important determinant of performance in embedded systems is the interaction between the processor and external memory. Embedded processors such as the CW4001 are equipped with on-chip instruction and data caches, which interface with larger off-chip memories. Since off-chip memory accesses usually stall the CPU execution for significant durations (each access could take 10 -20 processor cycles, depending on the relative processor and memory access speeds), it is important to design the interface between cache and main memory carefully [Patterson and Hennessy 1994]. Several architectural and compiler optimizations have been reported in the past that ensure spatial and temporal locality of programs so as to improve instruction and data caches.Cache misses can be classified into several categories:(1) compulsory misses-caused when a memory word is accessed for the first time;(2) capacity mis...