Exploiting fine-grain thread level parallelism on the MIT multi-ALU processor

Keckler, Stephen W.; Dally, William J.; Maskit, D.; Carter, N.P.; Chang, Andrew; Lee, W.S.

doi:10.1109/isca.1998.694790

Cited by 44 publications

(30 citation statements)

References 15 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…The M-Machine employed an on-chip cluster switch to connect the register bypass networks for three processors; an instruction writing to a remote register injects its result into the switch, which delivers the data to a waiting instruction on a remote processor [6]. The MIT RAW processor took this strategy further, by using a 4x4 mesh network to interconnect its processor tiles between execution units [7].…”

Section: Related Workmentioning

confidence: 99%

Implementation and Evaluation of a Dynamically Routed Processor Operand Network

Gratz

Sankaralingam

Hanson

et al. 2007

First International Symposium on Networks-on-Chip (NOCS'07)

View full text Add to dashboard Cite

Abstract-Microarchitecturally integrated on-chip networks, or micronets, are candidates to replace busses for processor component interconnect in future processor designs. For micronets, tight coupling between processor microarchitecture and network architecture is one of the keys to improving processor performance. This paper presents the design, implementation and evaluation of the TRIPS operand network (OPN). The TRIPS OPN is a 5x5, dynamically routed, 2D mesh micronet that is integrated into the TRIPS microprocessor core. The TRIPS OPN is used for operand passing, register file I/O, and primary memory system I/O. We discuss in detail the OPN design, including the unique features that arise from its integration with the processor core, such as its connection to the execution unit's wakeup pipeline and its in flight mis-speculated traffic removal. We then evaluate the performance of the network under synthetic and realistic loads. Finally, we assess the processor performance implications of OPN design decisions with respect to the end-toend latency of OPN packets and the OPN's bandwidth.

show abstract

Section: Related Workmentioning

confidence: 99%

Implementation and Evaluation of a Dynamically Routed Processor Operand Network

Gratz

Sankaralingam

Hanson

et al. 2007

First International Symposium on Networks-on-Chip (NOCS'07)

View full text Add to dashboard Cite

show abstract

“…Unlike the full/empty bits like fine-grain synchronization [21,3,11,2,17,15], which tags the entire memory of the machine by associating additional access state bits with each word in memory, the design of SSB is motivated by the following observation: at any instance only a small fraction of memory locations is actively participating in synchronization [25].…”

Section: Ssb: Supporting Efficient Fine-grain Synchronization On Manymentioning

confidence: 99%

“…HEP [21], Tera [3], MDP [11], Alewife [18,2], MMachine [17], Cray MTA-2 [1], the MT processor in Eldorado [15], and others associate additional access state bits (e.g., full/empty bits) with each word in entire memory. Fine-grain synchronization is achieved by accessing those word-level state bits in memory.…”

Section: Related Workmentioning

confidence: 99%

On the Role of Deterministic Fine-Grain Data Synchronization for Scientific Applications: A Revisit in the Emerging Many-Core Era

Zhu

Gao

2007

2007 IEEE International Parallel and Distributed Processing Symposium

View full text Add to dashboard Cite

show abstract

“…We applied their model to four specifically synthesized blocks: three units from the MMachine, a fine-grained multicomputer designed at MIT and Stanford [14], and the global placement of Magic, a synthesized controller chip from Stanford's Flash multiprocessor [15] (minus the artificially long hand-routed MiscBus). From Figures 4 and 5 (we only show one M-Machine plot for brevity), we see that the model is a good fit for the wire length distributions of these designs, which span a wide range of gate count.…”

Section: Wire Length Distributionsmentioning

confidence: 99%

Interconnect Scaling Implications for CAD

Ho¹,

Mai²,

Kapadia³

et al. 1999

View full text Add to dashboard Cite

Interconnect scaling to deep submicron processes presents many challenges to today's CAD flows. A recent analysis by Sylvester and Keutzer examined the behavior of average length wires under scaling, and controversially concluded that current CAD tools are adequate for future module-level designs. In our work, we show that average length wire scaling is sensitive to the technology assumptions, although the change in their behavior is small under all reasonable scaling assumptions. However, examining only average length wires is optimistic, since long wires are the ones that primarily cause CAD tool exceptions. In a module of fixed complexity, under both optimistic and pessimistic scaling assumptions, the number of long wires will increase slowly with scaling. More importantly, as the overall die capacity grows exponentially, the number of modules and thus the total number of wires in a design, will also increase exponentially. Thus, if the design team size and per-designer workload is to remain relatively constant, future CAD tools will need to handle long wires much better than current tools to reduce the percentage of wires that require designer intervention. IntroductionWith process technologies capable of fabricating a billion transistor chip on the horizon, the CAD community faces many new challenges. Notably, interconnect scaling in deep submicron processes may force a fundamental change in current ASIC design methodologies. If interconnect delay becomes a significant fraction of total delay, timing convergence for standard-cell design blocks will become difficult or impossible to achieve with current, crude, fanout-based wire load models. Designers will need new tools and methods to synthesize large blocks in deep submicron technologies.In a 1998 ICCAD tutorial, Sylvester and Keutzer carried out a detailed analysis of interconnect scaling and its potential effects on CAD methodologies [1] [2]. By examining the scaling of average length wires, they concluded that CAD tools are adequate for future module-level designs. We examine the sensitivity of their analysis to a range of possible technology scaling assumptions by modeling their simulations with an RC tree delay model.Since design speeds and timing convergence in synthesis flows are typically constrained by long wires, not averagelength ones, we extend the analysis to long wires. We show that for a fixed complexity design, the number of long wires grows slowly with scaling. If chip complexity remained constant, this increase in long wires could be handled by small improvements in today's CAD design flow. Unfortunately exponentially increasing die capacity exacerbates the increasing number of long wires per module by driving up the number of modules. Thus, with constant design team size, the number of gates per designer will grow exponentially. To prevent the per-designer workload from also growing exponentially the percentage of wires that need manual intervention must fall exponentially. This implies that future tools must handle a greater per...

show abstract

Exploiting fine-grain thread level parallelism on the MIT multi-ALU processor

Cited by 44 publications

References 15 publications

Implementation and Evaluation of a Dynamically Routed Processor Operand Network

Implementation and Evaluation of a Dynamically Routed Processor Operand Network

On the Role of Deterministic Fine-Grain Data Synchronization for Scientific Applications: A Revisit in the Emerging Many-Core Era

Interconnect Scaling Implications for CAD

Contact Info

Product

Resources

About