37th International Symposium on Microarchitecture (MICRO-37'04)
DOI: 10.1109/micro.2004.15
|View full text |Cite
|
Sign up to set email alerts
|

Dataflow Mini-Graphs: Amplifying Superscalar Capacity and Bandwidth

Abstract: A mini-graph is a dataflow graph that has an arbitrary internal size and shape but the interface of a singleton instruction: two register inputs, one register output, a maximum of one memory operation, and a maximum of one (terminal) control transfer. Previous work has exploited dataflow sub-graphs whose execution latency can be reduced via programmable FPGA-style hardware. In this paper we show that mini-graphs can improve performance by amplifying the bandwidths of a superscalar processor's stages and the ca… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
2

Citation Types

0
30
0

Publication Types

Select...
3
2
2

Relationship

0
7

Authors

Journals

citations
Cited by 54 publications
(32 citation statements)
references
References 32 publications
0
30
0
Order By: Relevance
“…Other recent work [5,23] proposed CCA structures specifically optimized for linear chains of execution. That is to say these structures only execute subgraphs that have two inputs, one output, and a small number of intermediate nodes.…”
Section: Related Workmentioning
confidence: 99%
See 1 more Smart Citation
“…Other recent work [5,23] proposed CCA structures specifically optimized for linear chains of execution. That is to say these structures only execute subgraphs that have two inputs, one output, and a small number of intermediate nodes.…”
Section: Related Workmentioning
confidence: 99%
“…Other recent work [5] proposes using the DISE [9] framework to dynamically replace subgraphs in the instruction stream. A special instruction is used to signal the DISE engine, which then inserts the appropriate control logic into the pipeline.…”
Section: Related Workmentioning
confidence: 99%
“…Because of this drawback, many researchers have investigated accelerator designs that are more generalized. Some examples of these programmable computation accelerators include 3-1 ALUs [13,20], ALU pipelines [5], closed-loop ALUs [22], and function units [24].…”
Section: Introductionmentioning
confidence: 99%
“…Generally speaking, the main compilation challenge in generating code for accelerators is determining which portions of an application to execute on the accelerator and which portions to leave on the standard pipeline. Some researchers have looked into this problem before, proposing greedy algorithms [5,14], exact methods with exponential runtimes [16,17], or exact methods in conjunction with heuristics to avoid degenerate cases [10]. Here, previously proposed compiler algorithms are extended to take into account the reduced interconnect and the data-centric latency of the proposed accelerator design.…”
Section: Introductionmentioning
confidence: 99%
“…Many DSPs have specialized hardware for common computations in signal and image processing, such as dot product, sum of absolute differences, and compare-select. A number of generalized accelerator designs have also been proposed, such as 3-1 ALUs [22,25], closed-loop ALUs [27], or ALU pipelines [5]. Larger accelerators can support bigger subgraphs and thus enhance the performance advantages.…”
Section: Introductionmentioning
confidence: 99%