switching execution to the other thread. The overall execution time is t 3 and (t 1 +t 2 )≥t 3 >max{t 1 , t 2 }. However, since the two threads compete for locations in the cache, one thread may evict the instructions cached by another, and the real threaded execution turns out to be the one that is shown in Figure 1(d). As can be seen, the number of cache misses, hence the thread switching frequency, is actually increased, so is the execution time (t 4 > t 3 ). Cache misses lead to accesses to memory. Since memory access is power consuming, significant power will be consumed. In addition, the thread switching incurs overhead, which further adds woes to the design.In this paper, we investigate the micro-architectural level solutions to make the cache behave in harmony with the threaded execution in the pipeline. Since instruction cache misses account for a considerably higher performance impact than data cache misses [1] due to frequent instruction fetches, we focus this study on the instruction cache. We target applications that offer embarrassing parallelism where the same code can be executed by a number of independent threads on different data sets. Such applications can be found in real-world computing problems such as encryption [2], scientific calculation [3], multimedia processing [4] and image processing [5]. Those large computing problems demand designs of multiprocessor systems that can be built on the small building-block processors like the one we discuss in this paper, as is illustrated in Figure 2. Such embedded processors are usually constrained with resources to reduce energy and area costs. We focus on the multi threaded processor with a single pipeline and a small cache that processes light weight applications and we present a thread synchronization approach that synchronizes thread execution