D. Leibholz scite author profile

D. Leibholz

4Publications

83Citation Statements Received

0Citation Statements Given

How they've been cited

171

How they cite others

Affiliations

Digital Wave (United States)

Publications

Order By: Most citations

A 600 MHz superscalar RISC microprocessor with out-of-order execution

Gieseke

Allmon²,

Bailey³

et al.

125

View full text Add to dashboard Cite

A six-issue, four-fetch, out-of-order execution, 6OOMHz Alpha microprocessor achieves an estimated 40SpecInt95,60SpecFP95 and 1800MB/s on McCalpin Stream. The 16.7x18.8mmz die contains 15.2M transistors and dissipates an estimated 72W. It is in 2.0V, 6-metal, 0.35pm CMOS with CMP planarization (Table 1) [ll. The chip is in a 587-pin ceramic IPGA with 198 pins for VDD/ VSS that includes a CuW heat slug for low thermal resistance between die and detachable heat sink. An on-chip PLL performs frequency multiplication of a differential PECL reference and synchronizes I/O by phase-aligning a CPU clock to the reference. Figure 1 is a detailed floorplan of the chip. Figure 2 depicts a blockf pipeline diagram of major sections and functions.The instruction fetcher ( Figure 3) reads four instructions per cycle plus a next-address pointer from a 64kB, 2-way pseudo-set associative, virtual instruction cache. The next-address pointer predicts the address of the subsequent four instructions and indexes the cache in the next cycle. In parallel, a branch predictor resolves the prediction. It contains three tables: a PC-indexed prediction table, a path-indexed prediction table, and a pathindexed table that dynamically chooses one of the former two predictions, based on the success of previous predictions. Fetched instructions are dispatched to integedmemory (INT/ MEM) and floating point (FP) pipelines, issued and executed outof order and retired in order. During dispatch, register specifiers are renamed to eliminate false dependencies by two twelve-port register mappers that dynamically map the architectural registers into a pool of physical registers (80 integer and 72 FP). Resulting map state is retained in an array until the instruction retires. Pre-retire map state is used to generate alist of remaining free physical registers. Buffered map state is restored when the CPU is redirected following a branch mispredict or exception.Mapped instructions enter a 20-entry INTMEM or a 15-entry FP issue queue. The INTMEM queue arbiter identifies the 4 oldest data-ready instructions. They issue to the integer execution unit (EBOX) and are removed from the INTMEM queue. Similarly, the FP queue issues the 2 oldest data-ready instructions to the FP execution unit (FBOX) and removes them from the FP queue.The EBOX (Figure 4) is divided into two clusters, CLO and CL1; each cluster contains 2 independent execution pipelines surrounding an 80-entry register file. Coherency between the two register file copies is maintained by broadcasting results across intercluster buses. Each of the four pipelines executes and bypasses arithmetic and logical operations in one cycle. Bypassed results between clusters take an additional cycle. The upper pipelines handle branches and shifts; CLO contains a pipelined multimedia engine (3-cycle latency) and CL1 contains a pipelined multiplier (7-cycle latency). The lower pipelines handle displacement address calculations for memory operations. The FBOX contains 2 independent execution pipelines surrounding a 72-en...

show abstract

The Alpha 21264: a 500 MHz out-of-order execution microprocessor

Leibholz

Razdan²

View full text Add to dashboard Cite

Design tradeoffs in stall-control circuits for 600 MHz instruction queues

Fischer

Leibholz²

View full text Add to dashboard Cite

A 6OOMHz superscalar Alpha microprocessor contains separate integer and floating-point issue units [I, 2,31. The integer issue Unit selects up to four data-ready instructions to issue out of a 20-entry queue. The floating-point issue unit, similarly, selects two instructions out of a separate 15-entry queue. Though their operation is similar, the two queues required different tradeoffs to meet design goals. This paper first describes functions common to both queues, thendiscusses specifictradeoffs madein theimplementation ofeach queue's stall control circuits.The queue performs three primary operations each cycle: compaction, enqueue, and issue. At the beginning of each cycle, valid queue entries (instructions) are conipacted to the bottom of the queue, creating free entries at the top. Four new instructions enter the queue (are enqueued), if there are enough free entries, during the compaction process. The four oldest data-ready instructions are then issued, broadcast to function units, and marked for removal. If the queue is full at enqueue time, the input pipeline must stall ( Figure 1). Generating the queue-full stall signal in each queue is acritical circuitpathrequiringdifferent tradeoffs for each ofthe two implementations.Each "queue-full" stall signal must be calculated early in the enqueue cycle to prevent the incoming group of instructions from over-writing queue entries that have not been freed. If the stall signal is not early, a latch stage and bypass path must be added to capture the incoming instructions for later enqueue, as shown in Figure2. Theadditionallatchandbypasspathareaanddelayofthis scheme are costly: 302 bits are enqueuedper instruction eachcycle. A single-cycle comparison of available queue entries and incoming instructions was not feasible in the 1.67ns cycle.The queue stall calculation is pipelined over two cycles and accounts for queue entries allocated and freed during the calculation itself. These "late" terms are conservative estimates of the true counts. The quality ofthe estimates is traded off against circuit complexity of generating accurate counts.Queue compaction logic generates shift controls to propagate empty entries to the top of the queue. The initial free-entry count, generated by this logic, is used in the stall equation: Stall (cycle i)=[free-entry count(i-2)+issue count(i-l)-enq. count(i-l)l

show abstract

DECchip 21066: the Alpha AXP chip for cost-focused systems

McKinney

Leibholz

Rosenbluth

et al.

View full text Add to dashboard Cite

scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.

Contact Info

customersupport@researchsolutions.com

10624 S. Eastern Ave., Ste. A-614

Henderson, NV 89052, USA

This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.

Blog Terms and Conditions API Terms Privacy Policy Contact Cookie Preferences Do Not Sell or Share My Personal Information

Made with 💙 for researchers

Part of the Research Solutions Family.

D. Leibholz

A 600 MHz superscalar RISC microprocessor with out-of-order execution

The Alpha 21264: a 500 MHz out-of-order execution microprocessor

Design tradeoffs in stall-control circuits for 600 MHz instruction queues

DECchip 21066: the Alpha AXP chip for cost-focused systems

Contact Info

Product

Resources

About