Dataflow Mini-Graphs: Amplifying Superscalar Capacity and Bandwidth

SIGARCH Comput. Archit. News

Blome

Chu

et al. 2005

Instruction set customization is an effective way to improve processor performance. Critical portions of application dataflow graphs are collapsed for accelerated execution on specialized hardware. Collapsing dataflow subgraphs will compress the latency along critical paths and reduces the number of intermediate results stored in the register file. While custom instructions can be effective, the time and cost of designing a new processor for each application is immense. To overcome this roadblock, this paper proposes a flexible architectural framework to transparently integrate custom instructions into a general-purpose processor. Hardware accelerators are added to the processor to execute the collapsed subgraphs. A simple microarchitectural interface is provided to support a plug-and-play model for integrating a wide range of accelerators into a pre-designed and verified processor core. The accelerators are exploited using an approach of static identification and dynamic realization. The compiler is responsible for identifying profitable subgraphs, while the hardware handles discovery, mapping, and execution of compatible subgraphs. This paper presents the design of a plug-and-play transparent accelerator system and evaluates the cost/performance implications of the design.

Section: Related Workmentioning

confidence: 99%

“…Other recent work [5] proposes using the DISE [9] framework to dynamically replace subgraphs in the instruction stream. A special instruction is used to signal the DISE engine, which then inserts the appropriate control logic into the pipeline.…”

Section: Related Workmentioning

confidence: 99%

An Architecture Framework for Transparent Instruction Set Customization in Embedded Processors

SIGARCH Comput. Archit. News

Blome

Chu

et al. 2005

“…Because of this drawback, many researchers have investigated accelerator designs that are more generalized. Some examples of these programmable computation accelerators include 3-1 ALUs [13,20], ALU pipelines [5], closed-loop ALUs [22], and function units [24].…”

Section: Introductionmentioning

confidence: 99%

“…Generally speaking, the main compilation challenge in generating code for accelerators is determining which portions of an application to execute on the accelerator and which portions to leave on the standard pipeline. Some researchers have looked into this problem before, proposing greedy algorithms [5,14], exact methods with exponential runtimes [16,17], or exact methods in conjunction with heuristics to avoid degenerate cases [10]. Here, previously proposed compiler algorithms are extended to take into account the reduced interconnect and the data-centric latency of the proposed accelerator design.…”

Section: Introductionmentioning

confidence: 99%

Exploiting Narrow Accelerators with Data-Centric Subgraph Mapping

Hormati

International Symposium on Code Generation and Optimization (CGO'07)

Mahlke

2007

“…Many DSPs have specialized hardware for common computations in signal and image processing, such as dot product, sum of absolute differences, and compare-select. A number of generalized accelerator designs have also been proposed, such as 3-1 ALUs [22,25], closed-loop ALUs [27], or ALU pipelines [5]. Larger accelerators can support bigger subgraphs and thus enhance the performance advantages.…”

Section: Introductionmentioning

confidence: 99%

Scalable subgraph mapping for acyclic computation accelerators

Proceedings of the 2006 International Conference on Compilers, Architecture and Synthesis for Embedded Systems

Hormati

Mahlke

et al. 2006

Computer architects are constantly faced with the need to improve performance and increase the efficiency of computation in their designs. To this end, it is increasingly common to see acyclic computation accelerators appear in embedded processor designs. One major problem with adding accelerators to a design is that it is difficult to generate high-quality code utilizing them. Hand-written assembly code is typical, and if compiler support does exist, it is implemented using only greedy algorithms. In this work, we investigate more thorough techniques for compiling to processors with acyclic accelerators. Where as greedy solutions only explore one possible solution, the techniques presented in this paper explore the entire design space, when possible. Intelligent pruning methods are employed to ensure compilation is both tractable and scalable. Overall, our new compilation algorithms produce code that performs on average 10%, and up to 32% better than standard greedy methods. These algorithms also run in less than one second for more than 98% of basic blocks tested.