Historically, advances in integrated circuit technology have driven improvements in processor microarchitecture and led to today's microprocessors with sophisticated pipelines operating at very high clock frequencies. However, performance improvements achievable by high-frequency microprocessors have become seriously limited by main-memory access latencies because main-memory speeds have improved at a much slower pace than microprocessor speeds. It's crucial to deal with this performance disparity, commonly known as the memory wall, 1 to enable future high-frequency microprocessors to achieve their performance potential.To overcome the memory wall, we propose kilo-instruction processors-superscalar processors that can maintain a thousand or more simultaneous in-flight instructions. Doing so means designing key hardware structures so that the processor can satisfy the high resource requirements without significantly decreasing processor efficiency or increasing energy consumption.
Nature of the memory wallOne of the first approaches to the memory wall problem was the development of cache memory hierarchies. Cache memories exploit program locality and can dramatically reduce the number of long-latency accesses to main memory. The first level, or L1 cache, is built into the processor core and typically takes one to three processor clock cycles to access. If there is a miss in the L1 cache, the on-chip L2 cache takes on the order of 10 processor cycles. Accessing main memory, on the other hand, takes at least an order of magnitude longer, and in the future this will become two orders of magnitude, that is, several hundred clock cycles. (In general, the cache hierarchy can have more than two levels, but to simplify our discussion here, we assume two levels with the understanding that the same principles apply to systems with deeper cache hierarchies.)Modern superscalar processors employ outof-order execution as a way of smoothing out disruptions caused by data cache misses (see the "Hiding latency in superscalar processors" sidebar). If a load instruction should experience a data cache miss, then instructions that depend on the miss data must wait in the issue queue(s). Meanwhile, independent instructions are free to execute; they issue from the issue queue(s) and essentially "pass" the blocked load instruction and its dependent instructions. For an L1 cache miss, these outof-order instructions can often completely hide the L2 access latency, so the miss causes little or no performance loss.This approach is much less effective for the long L2 cache misses, however. For example, along the top of Figure 1 is a sequence of instructions in program order. Following a