The implementation of a first-generation CELL processor that supports multiple operating systems including Linux consists of a 64b power processor element (PPE) and its L2 cache, multiple synergistic processor elements (SPE) [1] that each has its own local memory (LS) [2], a high-bandwidth internal element interconnect bus (EIB), two configurable non-coherent I/O interfaces, a memory interface controller (MIC), and a pervasive unit that supports extensive test, monitoring, and debug functions. The high level chip diagram is shown in Fig. 10.2.1. The key attributes include hardware content protection, virtualization and realtime support combined with extensive single-precision floatingpoint capability. By extending the Power architecture with SPE having coherent DMA access to system storage and with multioperating-system resource-management, CELL supports concurrent real-time and conventional computing. With a dual-threaded PPE and 8 SPEs this implementation is capable of handling 10 simultaneous threads and over 128 outstanding memory requests. Figure 10.2.7 shows the die micrograph with roughly 234M transistors from 17 physical entities and 580k repeaters and 1.4M nets implemented in 90nm SOI technology with 8 levels of copper interconnects and one local interconnect layer. At the center of the chip is the EIB composed of four 128b data rings plus a 64b tag operated at half the processor clock rate. The wires are arranged in groups of four, interleaved with GND and VDD shields twisted at the center to reduce coupling noise on the two unshielded wires. To ensure signal integrity, over 50% of global nets are engineered with 32k repeaters. The SoC uses 2965 C4s with four regions of different row-column pitches attached to a low-cost organic package. This structure supports 15 separate power domains on the chip, many of which overlap physically on the die. The processor element design, power and clock grids, global routing, and chip assembly support a modular design in a building-block-like construction.The chip contains 3 distinct clock-distribution systems, each sourced by an independent PLL, to support processor, bus interface, and memory-interface requirements. The main high-frequency clock grid covers over 85% of the chip, delivering the clock signal to processors and miscellaneous circuits. Second and third clock grids, each operating at fractions of the main clock signal, are interleaved with the main clock-grid structure, creating multiple clock frequency islands within the chip. All clock grids are constructed on the lowest impedance final two layers of metal, and are supported by a matrix of over 850 individually tuned buffers. This enables control of the clock-arrival times and skews, especially on the main clock grid that supports regions of widely varying clock-load densities. High-frequency clock-signal distribution optimization and verification rely on wire simulation models that includes frequency-sensitive inductance and resistance phenomena. As shown in Fig. 10.2.2, final worst-case clock skew ac...
We describe the challenges of migrating the Cell Broadband Engine TM (Cell BE) [1-2] design from a 65nm SOI [3] to a 45nm twin-well CMOS technology on SOI with low-κ dielectric (κ = 2.4) and 10 copper metal layers [4]. The technology offers dual-gateoxide thicknesses of 1.16nm and 2.5nm for 1.0V and 1.5V nominal power supply, respectively. Thicker oxide devices are used in analog circuits. To guarantee the proper operation of existing gaming software, the exact cycle-by-cycle machine behavior, including operating frequency, must be preserved. We set the focus of design migration to four goals: 1) automated design migration where possible, 2) 30% power reduction, 3) 30% area reduction, and 4) design for manufacturability (DFM) improvement. With the design rules across technologies being relatively compatible, we take advantage of automated migration for the bulk of Cell BE circuit blocks. Circuits are manually fine tuned for timing, noise tolerance, and design robustness after the initial automatic migration. We take a different approach with memory and analog circuits. Analog circuits do not scale well due to the required area for decoupling capacitance. The I/O area, especially the area for C4 bumps, dictates chip dimensions since the same number of I/O signals is required and the C4 pitch does not scale from the previous technology.Since digital circuits occupy the bulk of chip area, it is crucial to migrate them effectively. The original digital circuits in 65nm consist of 3 types of components: parameterized cells, common leaf cells (flip-flop and local clock buffer), and custom cells. The migration of parameterized cells is done through software. A tool called Migration Assistant Shape Handler (MASH) [5] is applied to the common leaf cells first. MASH first shrinks the shapes according to the scale factor between technologies. Second, MASH corrects as many design rule violations in 45nm as possible with minimum layout perturbations and the remaining violations are repaired manually. The pin locations for these cells are fixed and scaled only in size. Metal blockage changes are minimized to reduce the effect on higher design levels. Hierarchical migration is performed in 2 phases. The first phase includes placing scaled leaf cells at scaled coordinates and scaling any remaining shapes. The second phase applies MASH to remove design rule violations. Using this design migration methodology, we shrink the chip size by 34% with respect to 65nm. Figure 4.3.1 shows dimensions of Cell BE and its major partitions in 3 technologies.We take advantage of the automated approach as much as possible by applying it to smaller memory array blocks and then tuning circuits manually. As the SRAM cell size (0.404mm 2 ) shrinks in 45nm from 65nm (0.7mm 2 ), we address the cell stability concern due to process variability [6] by using a separate array power supply (V CS ). Lowering the main power supply (V DD ) is critical for reducing the chip power consumption. However, we cannot lower V CS by the same amount as V DD due to SRAM...
This paper reviews the design challenges that current and future processors must face with stringent power limits and high frequency targets, and the design methods required to address the continuing system integration trends. This paper then describes the implementation of a first-generation CELL processor and the design methods used to overcome the above challenges. A CELL Processor consists of a 64 bit Power Architecture processor coupled with multiple synergistic processors, a flexible IO interface, and a memory interface controller that supports multiple operating systems including Linux. This multi-core SoC, implemented in 90nm SOI technology, achieved a high clock rate by maximizing custom circuit design while maintaining reasonable complexity through design modularity and reuse.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.