The implementation of a first-generation CELL processor that supports multiple operating systems including Linux consists of a 64b power processor element (PPE) and its L2 cache, multiple synergistic processor elements (SPE) [1] that each has its own local memory (LS) [2], a high-bandwidth internal element interconnect bus (EIB), two configurable non-coherent I/O interfaces, a memory interface controller (MIC), and a pervasive unit that supports extensive test, monitoring, and debug functions. The high level chip diagram is shown in Fig. 10.2.1. The key attributes include hardware content protection, virtualization and realtime support combined with extensive single-precision floatingpoint capability. By extending the Power architecture with SPE having coherent DMA access to system storage and with multioperating-system resource-management, CELL supports concurrent real-time and conventional computing. With a dual-threaded PPE and 8 SPEs this implementation is capable of handling 10 simultaneous threads and over 128 outstanding memory requests. Figure 10.2.7 shows the die micrograph with roughly 234M transistors from 17 physical entities and 580k repeaters and 1.4M nets implemented in 90nm SOI technology with 8 levels of copper interconnects and one local interconnect layer. At the center of the chip is the EIB composed of four 128b data rings plus a 64b tag operated at half the processor clock rate. The wires are arranged in groups of four, interleaved with GND and VDD shields twisted at the center to reduce coupling noise on the two unshielded wires. To ensure signal integrity, over 50% of global nets are engineered with 32k repeaters. The SoC uses 2965 C4s with four regions of different row-column pitches attached to a low-cost organic package. This structure supports 15 separate power domains on the chip, many of which overlap physically on the die. The processor element design, power and clock grids, global routing, and chip assembly support a modular design in a building-block-like construction.The chip contains 3 distinct clock-distribution systems, each sourced by an independent PLL, to support processor, bus interface, and memory-interface requirements. The main high-frequency clock grid covers over 85% of the chip, delivering the clock signal to processors and miscellaneous circuits. Second and third clock grids, each operating at fractions of the main clock signal, are interleaved with the main clock-grid structure, creating multiple clock frequency islands within the chip. All clock grids are constructed on the lowest impedance final two layers of metal, and are supported by a matrix of over 850 individually tuned buffers. This enables control of the clock-arrival times and skews, especially on the main clock grid that supports regions of widely varying clock-load densities. High-frequency clock-signal distribution optimization and verification rely on wire simulation models that includes frequency-sensitive inductance and resistance phenomena. As shown in Fig. 10.2.2, final worst-case clock skew ac...
The POWER4 chip, functioning in the laboratory at frequencies >1GHz, contains two independent processor cores, a shared L2, an L3 directory and all of the logic needed to form large SMPs. The chip, containing over 170M transistors, is fabricated using a 0.18µm CMOS SOI technology with 7-layer copper metallization. The physical design challenges for this chip are to guarantee functionality of all circuits, meet cycle time goals, check complex ground rules, verify that the transistors implement the VHDL properly, and meet test, power, and clock-distribution requirements on an aggressive schedule with a design team at multiple geographically-separated sites.Each POWER4 core [1] is an out-of-order superscalar design containing an instruction fetch unit with its 64kB L1 instruction cache, an instruction decode unit, two fixed-point and two floating-point execution units, dual load store execution units with a dual-ported 32kB L1 data cache, a branch execution unit, an execution unit to perform logical operations on the condition register, and a sequencing unit to manage instructions in flight. Instructions can be issued to each execution unit every cycle. Up to 8 data and 3 instruction cache misses are supported. In excess of 200 instructions can be in various stages of execution. The two cores share an 8-way set-associative unified L2 organized as 3 independent cache controllers. In aggregate, 12 outstanding L2 misses can be supported by the L2. Figure 15.2.1 shows an 8-way module, with 4 POWER4 chips, that is used as a system building block. A photo of the actual multi-chip module is shown in Figure 15.2.2. All logic necessary to communicate between POWER4 chips is contained on the chip. Multiple modules can be interconnected to form larger SMP systems. POWER4 to POWER4 buses on and off module operate at half the processor speed. Buses to and from an off-chip L3 and memory operate at one-third the processor speed. Figure 15.2.3 lists the number of objects that are placed on the chip. The chip, with 2208 signal I/O C4s and over 5500 total C4s including power and ground, supports greater than 1Tb/s peak bandwidth.The chip physical design is built on a hierarchy of transistors, macros, units, microprocessor cores and chip. Three types of macros are employed: custom, SRAM and synthesized. During the high-level design phase, the macros, units, core and chip are all assigned contracts for timing, area, shape, wiring tracks and I/O. Timing and physical design of the chip are done concurrently on all levels of the hierarchy. All major buses are routed early in the design. Figure 15.2.4 shows the floorplanned buses. As the design progresses contracts are modified to reflect the actual design. Significant design constraints include maintaining a slew rate of <300ps on all transitions, with a wire signal delay of approximately 100ps/mm. These constraints require more than 70k buffers/inverters to be inserted. In the final months of the design, turn-around-time from entering design changes to a chip level timing run is <1 day. ...
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.