Sun Microsystems, Palo Alto, CAThis 3rd-generation, superscalar processor, implementing the SPARC V9 64b architecture, improves performance over previous processors by improvements in the on-chip memory system and circuit designs enhancing the speed of critical paths beyond the process entitlement [1,2]. In the on-chip memory system, both bandwidth and latency are scaled. Keys to scaling memory latency are a sum-addressed memory data cache, which allows the average memory latency to scale by more than the clock ratio, and the use of a prefetch data cache [3]. Memory bandwidth is improved by using wave-pipelined SRAM designs for on-chip caches and a write cache for store traffic [4]. The chip operates at 800MHz and dissipates <60W from a 1.5V supply. It contains 23M transistors (12M in RAM cells) on a 244mm 2 die. Figure 25.2.1 contrasts this 7-metal-layeraluminum, 0.15µm CMOS design with the previous generations designs. To deal with the growing microprocessor complexity, more aggressive circuit techniques, interconnect delay optimization, crosstalk reduction, improved power and clock distribution schemes, and better thermal management are used.For minimum power dissipation and simplified verification, the primary circuit style is static CMOS using synthesis and automatic place and route. Where synthesis is not enough and full custom design not appropriate, a hybrid approach is used. Domino cells are manually placed and CAD tools shield all wires, route clocks, and insert power and ground. A commercial router completes routing of signals. For the most critical paths, custom dynamic logic design is used. Delayed reset logic is used in the SRAM structures for power minimization and to simplify clock distribution. Large caches use a self-timed latency control circuit for one-cycle throughput and twocycle latency. A predecode flip-flop circuit incorporates the predecode logic function, eliminating 2 logic levels and significantly speeding up the address decoding critical path. Logical structures are traditional domino logic as well as delayed clocking domino logic with an overlapping multiphase non-blocking clocking. Critical signals are never gated by clocks, creating a pseudo-transparent evaluation phase that maximizes speed. Consecutive logic stages are clocked by delayed phases with enough overlap to guarantee safe signal transition. A family of edge-triggered flip-flops includes dynamic flipflops producing monotonic outputs for domino logic [5]. Members of this family also embed a full logic level while maintaining a low input-to-output delay, allowing a pipeline with only 8 logic stages per clock cycle. For ease of verification, dynamic design is chiefly confined to fully-shielded full-custom structures.To facilitate single-cycle transfers, the working register file (WRF), which handles regular read/write operations, and the architectural register file (ARF), which stores 8 windows, are interleaved into one physical unit, a WARF (Figure 25.2.2). The WARF performs read, write, and transfer simultaneously. The 32...