J. Grodstein scite author profile

J. Grodstein

5Publications

130Citation Statements Received

47Citation Statements Given

How they've been cited

192

129

How they cite others

Affiliations

Intel (United States), Digital Wave (United States), Digital Equipment (Germany)

Publications

Order By: Most citations

Logic decomposition during technology mapping

Lehman

Watanabe

Grodstein

et al.

View full text Add to dashboard Cite

Abstract-A problem in technology mapping is that the quality of the final implementation depends significantly on the initially provided circuit structure. This problem is critical, especially for mapping with tight and complicated constraints.In this paper, we propose a procedure which takes into account a large number of circuit structures during technology mapping. A set of circuit structures is compactly encoded in a single graph, and the procedure dynamically modifies the set during technology mapping by applying simple local transformations to the graph. State-of-the-art technology mapping algorithms are naturally extended, so that the procedure finds an optimal tree implementation over all of the circuit structures examined. We show that the procedure effectively explores the entire solution space obtained by applying algebraic decomposition exhaustively. However, the run time is proportional to the size of the graph, which is typically logarithmic in the number of circuit structures encoded.The procedure has been implemented and used for commercial design projects. We present experimental results on benchmark examples to demonstrate its effectiveness.

show abstract

A 1.2 GHz Alpha microprocessor with 44.8 GB/s chip pin bandwidth

Jain¹,

Anderson²,

Benninghoff³

et al.

View full text Add to dashboard Cite

This 4 th generation Alpha microprocessor running at 1.2GHz delivers up to 44.8GB/s chip pin bandwidth. It contains a 1.75MB, ECC protected, 7-way set associative 2 nd level write back-cache that delivers 19.2GB/s bandwidth; two memory controllers supporting 8 Rambus™ channels running at 800Mb/s; four 6.4GB/s inter-processor communication ports; and a separate IO port capable of 6.4GB/s operation. The 21.1x18.8mm 2 chip contains 152M transistors and dissipates 125W at 1.5V. It is packaged in a 1443-pin LGA package and air-cooled. It is fabricated in a 0.18µm, bulk CMOS process with 7 levels of copper interconnect. The chip is partitioned into four clocking domains and uses digital delay line loop to provide low skew, controlled edge rate, synchronous clocks across the chip. Figure 15.6.1 shows the major functional units. The CPU of the chip is leveraged from an earlier Alpha design [1,2].The Level-2 Cache Controller Unit manages the L2 cache and coordinates with the memory controllers and the router to service memory fills and cache coherence requests. It queues up to 16 Level-1 cache miss addresses, 16 L1/L2 victim addresses, 35 cache coherence requests and response addresses, and 4 write IO addresses. The address queues are implemented in a single 73entry Address Register File with multiple read, write and CAM ports (Figure 15.6.2). The data can be read or written every cycle from the top or bottom of the register file to minimize routing distance. The entry status bits are stored in a separate array. The address queues and status arrays generate up to 25 requests that the arbiter dynamically schedules onto 6 unique resources every cycle.Misses from local or remote processors arrive at the memory controller and are stored in two 32-entry Directory In Flight Tables (DIFT). Incoming transactions are compared against pending DIFT entries. Depending on the result of the CAM lookup, the transaction is either serialized to avoid hazards, or merged to deliver a response to a pending transaction, or a new entry is created. The DIFT tracks the coherence state for 32 possible outstanding transactions. The DIFT issue logic picks 1 of these 32 arbitrating entries. A scoreboard tracks the availability of 3 physical resources, which are used by 5 different classes of transactions. The issue logic is pipelined over 2 cycles and accounts for entries selected from the previous cycle. The DIFT can issue to the 3 resources each cycle, one of which is the DRAM controller ( Figure 15.6.3). The DIFT arbiter can pick an entry each cycle while avoiding deadlock and maintaining fairness.The DRAM controller (Figure 15.6.4) has a conflict-detection pipeline that maximizes parallelism and minimizes DRAM bank conflicts. It translates the request address into device, bank, row, and column fields optimized for the RDRAM™ configuration in the 1st cycle. It indexes a table that tracks open pages and active banks in the RDRAMs and determines the appropriate action in the memory system in the 2 nd cycle. It then compares the address against a 2...

show abstract

Logic decomposition during technology mapping

Lehman

Watanabe

Grodstein

et al. 1997

IEEE Trans. Comput.-Aided Des. Integr. Circuits Syst.

View full text Add to dashboard Cite

A delay model for logic synthesis of continuously-sized networks

Grodstein¹,

Lehman²,

Harkness³

et al.

View full text Add to dashboard Cite

model will enable us to use a modified tree-mapping technology to efficiently produce continuously-sized netlists satisfying certain electrical noise and power constraints. Abstract: We present a new delay model for use in logic synthesis. A traditional model treats the area of a library cell as constant and makes the cell's delay a linear function of load. Our model is based on a different, but equally fundamental linearity in the equation relating area, delay, and load: namely, we may keep a cell's delay constant by making its area a linear function of load. This allows us to technology map using a library with continuous device sizing, satisfies certain electrical noise and power constraints, and in certain cases is computationally simpler than a traditional model. We give results to support these claims. A companion paper [14] uses the computational simplicity to explore a wide search space of algebraic factorings in a mapped network.Our own application is for continuously-sized, fullcustom designs. However, the delay model is also applicable to other methodologies, such as high-end standard cell, where there are many sizes of each cell. Essentially, it applies to any technology where cell sizing to obtain a desired delay is viable.Constant-delay modeling has been used frequently in technology-independent algorithms. For example, Wang [15,pg.167] proposed decomposing a network into bounded-fanin NAND gates, assigning a unit delay to each level of logic, and determining and restructuring critical regions with the resulting arrival times.Singh [12, pp.13-19] has measured the accuracy of various technology-independent delay models. He concluded that the unit-delay model on bounded-fanin gates was the most accurate. His speedup[8] made technology-independent decomposition decisions by first breaking down the network into two-input NAND gates, and then modeling each NAND gate as a unit delay. Introduction.Most technology mapping algorithms for logic synthesis have been targeted at technologies with a limited number of cell sizes. A straightforward modeling technique will then model each library element with a unique cell, whose area is fixed and whose delay varies with output loading. A class of technology-mapping algorithms called tree-mapping [1,2,3] is well suited to such a model. Given a tree-structured network and a fixed cell library, tree-mapping algorithms run in time linear in the number in the number of circuit nodes. They are also linear in the number of library cells, which is of course not a problem for these reasonably-small libraries.

show abstract

AutoRex: An automated post-silicon clock tuning tool

Tadesse

Grodstein

Bahar

2009

View full text Add to dashboard Cite

Post-silicon clock-tuning is a technique used as part of speed-debug efforts to increase the allowable clock frequency of a chip. These days, it is not uncommon for high-end microprocessors to have cores containing a few thousand clock-tuning elements (i.e., variable-delay buffers). Each such buffer can be assigned to one of several possible discrete delay values, as part of the post-silicon speed debugging process. With the proper mix of assignments, many chips that initially could not meet targeted speed requirements, can now run within specification. With thousands of tunable buffers available on chip, the possible combination of assignments to the delay values is quite large. In addition, process variation causes the same design, once fabricated into silicon, to have different critical paths across different chips. Thus a specific buffer-delay assignment that most improves clock frequency for some chips may not be optimal for all chips. In this paper, we propose a tool we call AutoRex, that produces clock-tuning assignments automatically. AutoRex operates by taking data from a volume experiment across multiple process corners and analyzes this data using Satisfiability Modulo Theory (SMT) solvers to create a single "recipe" for delay buffer assignments such that the clock frequency of the chip is improved as much as possible over the entire sample of chips. Our results show up to a 9% improvement in frequency using AutoRex.

show abstract

scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.

Contact Info

customersupport@researchsolutions.com

10624 S. Eastern Ave., Ste. A-614

Henderson, NV 89052, USA

This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.

Blog Terms and Conditions API Terms Privacy Policy Contact Cookie Preferences Do Not Sell or Share My Personal Information

Made with 💙 for researchers

Part of the Research Solutions Family.

J. Grodstein

Logic decomposition during technology mapping

A 1.2 GHz Alpha microprocessor with 44.8 GB/s chip pin bandwidth

Logic decomposition during technology mapping

A delay model for logic synthesis of continuously-sized networks

AutoRex: An automated post-silicon clock tuning tool

Contact Info

Product

Resources

About