This 64-bil RISC microprocessor performs a load/store inslruc lion in one clock and achieves 40 Y1IPS and 20 MFLOPS peak performance at 40YlHz clock. Two techniques are adopled lo allain this performance: (1) two translalion lookaside bul' fers (TLB) with parallel and hierarchical word-line transition detect· ion circuits, and (2) a self-clocked register file usin� a dataflow clocking scheme. A floating-point unit performs single and double precision rIoating-point operations concurrently with an inleger unit. This chip is fabricated usin� O.Bl'm double-metal C\10S technology. About 1M transistors are contained in the 14.B5xI5.13mm die honsed in a 23B-pin PGA. The power dis sipation is about 4W (max.}. A micrograph of the chip is shown in Figure LFigure 2 shows a block diagram of the processor. It COIl sists of an integer unit (IU), a Hoating-point unit (FPU), a memory management unit (MMU), a 2kB 2-way set-associative write-back physical dala cache (DCACHE), a 6kB 3-way sct associative instruction cache (ICACIIE), a bns control unit (BCU) and an error-correction circuit (ECC) for one-bit error correction and two-bit crror detection. The MMlJ contains an instruction TLB (ITLB) for instruction address translation and ageneral-purpose TLB (GTLB) for imtruetion/data address tralls lation. In order to increaRe the performance of the proc""or, the external data bus, the internal ICACHE and DCACIlE data huses, integer unit data buses, and the data paths in FPU arc all 64 or more bits wide.Reducing the number of clock cycles to execute load/storr instructions whose frequency ranges from 15 to 25°/, in many programs is very important to cut the CPI. Also the physical caches are used for cachc coherency management and to avoid performance degradation of multiprocessor syslems un context switching. A parallel translation of data and instruction address es hy GTLB and ITLB, and . self-clocked register file are adopt ed so that load/store instructions can be performed in one clock cycle using the sallie 4-stage pipeline (IF, L, E and S) as other ALU instructions. Figure 3 shows details of the pipeline stages of the load in struction. In the IF -stage, the instruction address is translaled by the ITLB and the ICACHE is read if the lCACHE hits. Data is read from a self-clocked register rile and the data address is calculated in the L-stage. In the E-stage, the data address is trans lated by GTLB and the physical DCACIIE is read if the DCACHE hits. Data is stored into the register file in the S-stage. A physical cache access time tends lo bc slower than a logical cache because of the address translation time in the TLl\. To complete address translation and cache read in one clock, the 136-entry GTLB uses a word-line transition detection scheme and operatcs in 11.5ns. (Figure 4) An instruction addrcss is translated by the B-entry ITLB concurrently with a data address tramlation by the GTLB. The ITLB oper.le6 in 7.5ns with non "recharged static circuitry. If the ITLB misses, the GTLB is accessed with the instruction address in th...
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.