The next-generation System z design introduces a new microprocessor chip (CP) and a system controller chip (SC) aimed at providing a substantial boost to maximum system capacity and performance compared to the previous zEC12 design in 32nm [1,2]. As shown in the die photo, the CP chip includes 8 high-frequency processor cores, 64MB of eDRAM L3 cache, interface IOs ("XBUS") to connect to two other processor chips and the L4 cache chip, along with memory interfaces, 2 PCIe Gen3 interfaces, and an I/O bus controller (GX). The design is implemented on a 678 mm 2 die with 4.0 billion transistors and 17 levels of metal interconnect in IBM's high-performance 22nm high-κ CMOS SOI technology [3]. The SC chip is also a 678 mm 2 die, with 7.1 billion transistors, running at half the clock frequency of the CP chip, in the same 22nm technology, but with 15 levels of metal. It provides 480 MB of eDRAM L4 cache, an increase of more than 2× from zEC12 [1,2], and contains an 18 MB eDRAM L4 directory, along with multi-processor cache control/coherency logic to manage inter-processor and system-level communications. Both the CP and SC chips incorporate significant logical, physical, and electrical design innovations.Systems are built from configurable nodes of tightly-coupled CP and SC chips, each packaged on single-chip modules ( Fig. 4.1.1). This structure provides improved flexibility and modularity compared to the multi-chip modules used previously. All high-speed node-to-node and drawer-to-drawer communication is through the SC chip using micro-controllers to manage the flow. Each SC chip contains over 440 of these micro controllers along with a series of wide multiplexers to manage the traffic. Both the CP and SC chips support high levels of I/O bandwidth, with about 5Tb/s total bandwidth for each CP or SC chip, running at speeds of up to 5Gb/s (single-ended) and 9.6Gb/s (differential).The CP chip adopted a unique floorplan configuration, driven by the width of the cores, which were too wide to fit four across on the die. This floorplan created significant logical and physical complexities in the L3 design, but careful engineering prevented these issues from having any meaningful impact on latency or bandwidth of the L3. The entire L3 and all 8 cores are covered with a single large "mega-mesh" clock domain, maximizing on-chip bus bandwidth. The unified mega-mesh design enables double-pumping of many on-chip buses for wider effective bandwidth, and eliminates any mesh-to-mesh timing margins in critical core-to-L3 timing paths.The CP processor core design, shown in Fig. 4.1.2, improves upon the zEC12 processor [4] with two vector execution units, significantly higher instruction-per-cycle throughput, and a new SMT2 micro-architecture supporting simultaneous execution of two threads. The microprocessor core features a wide superscalar, out-of-order pipeline that can sustain an instruction fetch, decode, dispatch and completion rate of six CISC instructions per cycle. The instruction execution path is predicted by multi-level bra...