This paper describes a hybrid latch-flipflop (HLFF) timing methodology aimed at a substantial reduction in latch latency and clock load. A common principle is employed to derive consistent 1996 IEEE International Solid-state Circuits Conference 0-7803-3136-2 I 96 I $5.00 I 0 lEEE
This sixth-generation X86 instruction-set compatible microprocessor implements a set of multimedia extensions (MMX). Instruction predecoding to identify instruction boundaries begins during filling of the 32kB two-way set associative L1 instruction Cache after which the predecode bits are stored in the 20kB Predecode Cache as shown in Figure 1. The processor decodes up to two X86 instructions per clock, most of which are decoded by hardware into one to four RISC-like operations, called RISC86 Ops, whereas the uncommon instructions are mapped into ROMresident RISC sequences. The instruction scheduler buffers up to 24 RISC86 operations, using register renaming with a total of 48 registers. Up to six RISC86 instructions are issued out-of -order to seven parallel execution units, speculatively executed and retired in order. The branch algorithm uses two-level branch prediction based on an 8192-entry branch history table, a 16-entry branch target cache and a 16-entry return address stack.The processor incorporates the extensions to the X86 instruction set called the multi-media extensions (MMX). The MMX unit supports instruction and data types that are targeted at increasing performance in communications and multimedia. A single instruction, multiple data (SIMD) technique is used to process multiple operands of 8,16, or 32b in a 64b data path to perform highly-parallel and compute intensive algorithms involved in multimedia applications. The MMXunit supports 57 instructions which allow additions, subtractions, multiplies, multiply-accumulates, shifts (logical or arithmetic) and several other operations, most of which can be executed on any data type.The instruction tag ram contains 512 20b physical tags. The tag ram is logically 2-way set associative, but is physically constructed with 8 sets of tag-tlb comparators and 8 sets of snoop comparators, with 8 tags being read each cycle. This allows all possible synonyms to be checked in a single cycle, a t the expense of layout complexity and area. The tag ram performs a read in the first half cycle and a write in the second half cycle. Write data is available at the beginning of the first half of the cycle and can be bypassed to the read outputs with no read access delay penalty. The sense amp with integrated bypass is shown in Figure 2.The numeric processor PLA contains 17 inputs, 800 minterms, and 104 outputs. The AND and OR planes and their respective sense amps are differential. A partial transistor (drain, no source) provides a matched capacitive environment for dummy bit-lines.The RISC86 Op code ROM contains 4kx 169b of storage. Bit-lines are single-ended, but are sensed differentially with respect to a reference line. Four bit-lines share a common reference line, with 4:l column decoding. Minimum pitch metal1 is used for bit-lines with no shielding. This is possible due to the use of resistive load elements for both bit and reference lines. The load elements are constructed of PMOS transistors biased in the linear region. The reference loads have half the resistanc...
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.