In superscalar processors, capable of issuing and executing multiple instructions per cycle, fetch performance represents an upper bound to the overall processor performance. Unless there is some form of instruction re-use mechanism, you cannot execute instructions faster than you can fetch them.Instruction Level Parallelism, represented by wide issue out of order superscalar processors, was the trending topic during the end of the 90's and early 2000's. It is indeed the most promising way to continue improving processor performance in a way that does not impact application development, unlike current multicore architectures which require parallelizing the applications (a process that is still far from being automated in the general case). Given that the typical basic block size in SPEC integer benchmarks is 4-6 instructions, fetch performance was limited to those same 4-6 instructions per cycle, making 8-wide and 16-wide superscalar processors impractical. It became imperative to find mechanisms to fetch more than 8 instructions per cycle, and that meant fetching more than one basic block per cycle.
The Trace Cache [2] [3][4] quickly established itself as the state of the art in high performance instruction fetch. The trace cache relies on a trace building mechanism that dynamically reorders the control flow of the program, and stores the dynamic instruction sequences in sequential storage, increasing fetch width.However, it is a complex hardware structure that adds not only another cache memory to the on-chip storage hierarchy, but also requires a branch predictor capable of issuing multiple predictions per cycle to index the contents of the trace cache, and distinguish between multiple dynamic sequences of instructions.The Software Trace Cache is a compiler transformation, or a postcompilation binary optimization, that extends the seminar work of Pettis and Hansen PLDI'90 to perform the reordering of the dynamic instruction stream into sequential memory locations using profile information from previous executions. The major advantage compared to the trace cache, is that it is a software optimization, and does not require additional hardware to capture the dynamic instruction stream, nor additional memories to store them. Instructions are still captured in the regular instruction cache. The major disadvantage is that it can only capture one of the dynamic instruction sequences to store it sequentially. Any other control flow will result in taken branches, and interruptions of the fetch stream.Our results show that fetch width using STC was competitive with that obtained with the hardware TC, and was applicable to a wide range of superscalar architectures, because it does not require hardware changes to enable fetching from multiple basic blocks in a single cycle, even if only one branch prediction can be issued. Any front-end architecture built on a BTB that ignores branches as long as they are not taken, will automatically treat such branches as NOP instructions in terms of fetch [5].This was only the...