The IBM POWER A processor is the dominant reduced instruction set computing microprocessor in the world today, with a rich history of implementation and innovation over the last 20 years. In this paper, we describe the key features of the POWER7 A processor chip. On the chip is an eight-core processor, with each core capable of four-way simultaneous multithreaded operation. Fabricated in IBM's 45-nm silicon-on-insulator (SOI) technology with 11 levels of metal, the chip contains more than one billion transistors. The processor core and caches are significantly enhanced to boost the performance of both single-threaded response-time-oriented, as well as multithreaded, throughput-oriented applications. The memory subsystem contains three levels of on-chip cache, with SOI embedded dynamic random access memory (DRAM) devices used as the last level of cache. A new memory interface using buffered double-data-rate-three DRAM and improvements in reliability, availability, and serviceability are discussed.
The POWER8i processor is the latest RISC (Reduced Instruction Set Computer) microprocessor from IBM. It is fabricated using the company's 22-nm Silicon on Insulator (SOI) technology with 15 layers of metal, and it has been designed to significantly improve both single-thread performance and single-core throughput over its predecessor, the POWER7 A processor. The rate of increase in processor frequency enabled by new silicon technology advancements has decreased dramatically in recent generations, as compared to the historic trend. This has caused many processor designs in the industry to show very little improvement in either single-thread or single-core performance, and, instead, larger numbers of cores are primarily pursued in each generation. Going against this industry trend, the POWER8 processor relies on a much improved core and nest microarchitecture to achieve approximately one-and-a-half times the single-thread performance and twice the single-core throughput of the POWER7 processor in several commercial applications. Combined with a 50% increase in the number of cores (from 8 in the POWER7 processor to 12 in the POWER8 processor), the result is a processor that leads the industry in performance for enterprise workloads. This paper describes the core microarchitecture innovations made in the POWER8 processor that resulted in these significant performance benefits.
The Synergistic Processor Element (SPE) is the first implementation of a new processor architecture designed to accelerate media and streaming workloads. Area and power efficiency are important enablers for multi-core designs that take advantage of parallelism in applications [2]. The architecture reduces area and power by solving scheduling problems such as data fetch and branch prediction in software. SPE provides an isolated execution mode that restricts access to certain resources to validated programs.The focus on efficiency comes at the cost of multi-user operating system support. SPE load and store instructions are performed within a local address space, not in system address space. The local address space is untranslated, unguarded and non-coherent with respect to the system address space and is serviced by the local store (LS). Loads, stores and instruction fetch complete without exception, greatly simplifying the core design. The LS is a fully pipelined, single-ported, 256kb SRAM [3] that supports quadword (16B) or line (128B) access.The SPE is a SIMD processor programmable in high level languages such as C or C++ with intrinsics. Most instructions process 128b operands, divided into four 32b words. The 128b operands are stored in a 128-entry-unified-register file used for integer, floating point and conditional operations. The large register file facilitates deep unrolling to fill execution pipelines. Figure 7.4.1 shows how the SPE is organized as well as the key bandwidths (per cycle) between units.Instructions are fetched from the LS in 32 4B groups. Fetch groups are aligned to 64B boundaries to improve the effective instruction fetch bandwidth. 3.5 fetched lines are stored in the instruction line buffer (ILB) [1]. One-half line holds instructions while they are sequenced into the issue logic; as another line holds the single entry software managed branch target buffer (SMBTB) and two lines are used for inline prefetching. Efficient software manages branches in three ways: it replaces branches with bit-wise select instructions; it arranges for the common case to be inline; it inserts branch hint instructions to identify branches and load the probable targets into the SMBTB.The SPE can issue up to two instructions per cycle to seven execution units organized in two execution pipelines. Instructions are issued in program order. Instruction fetch sends double word address-aligned instruction pairs to the issue logic. Instruction pairs are issued if the first instruction (from an even address) is routed to an even pipe unit and the second instruction to an odd pipe unit. Loads and stores wait in the issue stage for an available LS cycle. Issue control and distribution require three cycles. Figure 7.4.5 details the eight execution units. Unit to pipeline assignment maximizes performance given the rigid issue rules. Simple fixed point [4], floating point [5] and load results are bypassed directly from the unit output to input operands reducing result latency. Other results are sent to the forward macro whe...
The IBM POWER6e microprocessor core includes two accelerators for increasing performance of specific workloads. The vector multimedia extension (VMX) provides a vector acceleration of graphic and scientific workloads. It provides single instructions that work on multiple data elements. The instructions separate a 128-bit vector into different components that are operated on concurrently. The decimal floating-point unit (DFU) provides acceleration of commercial workloads, more specifically, financial transactions. It provides a new number system that performs implicit rounding to decimal radix points, a feature essential to monetary transactions. The IBM POWERe processor instruction set is substantially expanded with the addition of these two accelerators. The VMX architecture contains 176 instructions, while the DFU architecture adds 54 instructions to the base architecture. The IEEE 754R Binary Floating-Point Arithmetic Standard defines decimal floating-point formats, and the POWER6 processor-on which a substantial amount of area has been devoted to increasing performance of both scientific and commercial workloads-is the first commercial hardware implementation of this format.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.