The optimal logic depth per pipeline stage is 6 to 8 FO4 inverter delays

Hrishikesh, M.S.; Jouppi, Norman P.; Farkas, Keith I.; Burger, Doug; Keckler, Stephen W.; Shivakumar, Premkishore

doi:10.1109/isca.2002.1003558

Cited by 121 publications

(75 citation statements)

References 8 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…A 16-bit addition, the main component of the RC's clock period, has been found to have a latency of 1 3 2 5 4 5 6 8 79 2 18.29 FO4 delays. This is in line with the cycle times of recent Intel processors, which have ranged from 12-20 FO4 delays [11]. However, future programmable processors are expected to include much less logic than current designs in each pipeline stage, leading to greater increases in clock rates than would be caused by technology scaling alone.…”

Section: Resultssupporting

confidence: 65%

“…Matching the predictions in the ITRS requires that the clock period of a programmable cluster decrease to 3.16 FO4 delays in 22nm processes. Results from [11] indicate that reducing clock periods to less than 6 to 8 FO4 delays per cycle hurts overall performance, leading us to believe that programmable processor clock rates will not scale at the rates predicted by the ITRS.…”

Section: Gatementioning

confidence: 99%

See 1 more Smart Citation

A reconfigurable unit for a clustered programmable-reconfigurable processor

Kujoth

Wang

Gottlieb

et al. 2004

Proceedings of the 2004 ACM/SIGDA 12th International Symposium on Field Programmable Gate Arrays

View full text Add to dashboard Cite

In a clustered programmable-reconfigurable processor, multiple programmable processors and blocks of reconfigurable logic communicate through a register-based communication mechanism, which reduces the impact of wire delay on clock cycle time. In this paper, we present a circuit-level design for the reconfigurable clusters used on the Amalgam programmable-reconfigurable processor. We outline our interleaved reconfigurable array design, which provides high bandwidth to and from the register file without requiring large amounts of register control logic. We characterize the latency of operations in our array, and present results that show the impact that this latency has on overall system performance in a range of fabrication processes. Finally, we present a pipelining scheme that enables the array to operate at clock rates closer to those of programmable processors and allows for better scaling in future technologies.

show abstract

Section: Resultssupporting

confidence: 65%

Section: Gatementioning

confidence: 99%

A reconfigurable unit for a clustered programmable-reconfigurable processor

Kujoth

Wang

Gottlieb

et al. 2004

Proceedings of the 2004 ACM/SIGDA 12th International Symposium on Field Programmable Gate Arrays

View full text Add to dashboard Cite

show abstract

“…The cache bank dimensions enable the calculation of wire lengths between successive routers. Based on delays for B-wires (Table 1) and a latch overhead of 2 FO4 [17], we compute the delay for a link (and round up to the next cycle for a 5 GHz clock). The (uncontended) latency per router is assumed to be three cycles.…”

Section: Extensions To Cactimentioning

confidence: 99%

Interconnect design considerations for large NUCA caches

MuralimanoharNaveen

BalasubramonianRajeev

2007

SIGARCH Comput. Archit. News

View full text Add to dashboard Cite

The ever increasing sizes of on-chip caches and the growing domination of wire delay necessitate significant changes to cache hierarchy design methodologies. Many recent proposals advocate splitting the cache into a large number of banks and employing a network-on-chip (NoC) to allow fast access to nearby banks (referred to as Non-Uniform Cache Architectures -NUCA). Most studies on NUCA organizations have assumed a generic NoC and focused on logical policies for cache block placement, movement, and search. Since wire/router delay and power are major limiting factors in modern processors, this work focuses on interconnect design and its influence on NUCA performance and power. We extend the widely-used CACTI cache modeling tool to take network design parameters into account. With these overheads appropriately accounted for, the optimal cache organization is typically very different from that assumed in prior NUCA studies. To alleviate the interconnect delay bottleneck, we propose novel cache access optimizations that introduce heterogeneity within the inter-bank network. The careful consideration of interconnect choices for a large cache results in a 51% performance improvement over a baseline generic NoC and the introduction of heterogeneity within the network yields an additional 11-15% performance improvement.

show abstract

“…Our circuit parameters (Table 1) are chosen to represent a wide spectrum of CMOS technologies from recent-past (180nm) to near-future (70nm) technologies. The clock speeds are scaled proportionally to the gate delays and match the aggressive 8 fanout-of-four (FO4) delay for each technology [11]. Therefore, the cycle time stays the same relative to the gate delay and a single pipeline stage employs the same number of logic levels across technologies.…”

Section: Methodsmentioning

confidence: 99%