AMD's 2-core "Bulldozer" module contains 213 million transistors in an 11metal layer 32nm HKMG SOI CMOS process and is designed to operate from 0.8 to 1.3V. This new micro-architecture [1] improves performance and frequency while reducing area and power compared to a previous AMD x86-64 CPU in the same process [2]. To achieve these goals, the design reduced the number of FO4 inverter delays/cycle by more than 20%, achieving higher frequencies in the same power envelope even with increased core counts. The 2-core CPU module area (including 2MB L2 cache) is 30.9mm 2 (Fig. 4.5.7).The module design contains 84 unique custom macros and 317,000 scannable flops. Module-level VSS power gating (CC6) reduces leakage power by 95% when both cores are idle [2]. Transistor Vts across the design are mostly regular (47%) and long-channel regular (46%).The Bulldozer micro-architecture is cycle-based, using soft-edge flip-flops (SEF) to provide high-frequency performance, process variation tolerance, and low power consumption (Fig. 4.5.1). Performance and process tolerance are provided by a 2-clock design: early and late clocks (ECLK, LCLK) create a soft timing edge, allowing limited cycle stealing. Power is reduced in low-power SEFs by internally gated slave latch clocks. The majority of flops (78%) are low-power, using high-performance flops only on timing-critical paths.In contrast to leveraged power-optimized CPU designs [2,4], Bulldozer's groundup design requires co-development of power efficiency, timing, and functionality. Initially, micro-architectural power is optimized using a power-aware highlevel performance model. Next, before schematic completion, the team tracks and analyzes RTL-based clock and flip-flop activity (a proxy for switching power) to meet clock gating goals. Finally, a new power model enables early mixed schematic/layout analysis of transistor-level power. This enables aggressive power optimizations while the implementation is still malleable. The result is a design with low power consumption for typical applications, making it well-suited to active power management and boost (Fig. 4.5.2).The L1 caches are split, with I-cache residing in the instruction unit and a Dcache located in each load/store unit of the 2-cores. The 2-way, 64KB I-cache consists of an 8×2 array of 4KB bank macros, with 2 more arrays for pre-decode bits. Load/store area in the 2 cores is at a premium, so the D-cache uses a 4way 16KB array with performance features described later in the paper. Both L1 caches use an 8T storage cell. The change from a 6T cell in 45nm to 8T in 32nm was required to improve low-voltage margin and read timing and to reduce power. Use of the 8T cell also eliminated a difficult D-cache read-modify-write timing path. Reads use a 2-level pre-charged local/super bitline structure with delayed-onset keeper, single-rail, full-swing signals, and glitch latches.Several D-cache performance features reduce conflicts and power. First, microbanking reduces read conflicts to the same rate as a previous 16-bank 64KB desi...