Generation of Customized Accelerators for Loop Pipelining of Binary Instruction Traces

Paulino, Nuno; Ferreira, Joao Canas; Cardoso, João M. P.

doi:10.1109/tvlsi.2016.2573640

Cited by 6 publications

(13 citation statements)

References 16 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Offline detection benefits from few constraints regarding memory, processing power, or overhead and can extract lengthier binary segments. To implement offline detection, some approaches integrate the detection into the compiler [17,32,40,44] (1), while others analyze the application post-compilation either statically [70] (2) or by simulation [67,73] (3). Additional outputs, such as profiling information may also be produced [73].…”

Section: Segment Detectionmentioning

confidence: 99%

“…This is typically employed to accelerate larger code segments, e.g., segments representing loops, which benefit from reutilization of accelerator resources. A single configuration word might be sufficient to control the context of the accelerator throughout execution [92] or several configuration words may be required to implement a temporal partition of a segments CDFG, i.e., a schedule [73]. Regardless, local memories can be employed to store the configurations/instructions.…”

Section: Segment Translationmentioning

confidence: 99%

“…Either case requires posterior synthesis via tools, which constrains this degree of specialization to offline implementations. However, like detection, offline translation can benefit from generating more specialized hardware, in combination with loop pipelining [73] and more agressive exploitation of data-parallelism [56].…”

Section: Segment Translationmentioning

confidence: 99%

“…In summary, we identify three major methods for accelerator generation or configuration: (1) targeting a static pre-designed accelerator architecture [40,70], (2) specializing an architecture template followed by generation of configuration words, instructions, or schedules [17,32,73], and (3) generation of a full custom hardware description [56,58]. Table 1 summarizes taxonomy on the binary translation aspect of each approach, presenting the possible values we have outlined for each feature (e.g.…”

Section: Segment Translationmentioning

confidence: 99%

“…Nearly all reviewed approaches support, at most, all logic operations and integer arithmetic, with the frequent exception of integer division. Only some cases support floating or fixed-point arithmetic [39,67,73]. Unless the host processor directly controls the accelerator, some comparison operations are also supported to determine when execution should terminate.…”

Section: Supported Operationsmentioning

confidence: 99%

See 4 more Smart Citations

Improving Performance and Energy Consumption in Embedded Systems via Binary Acceleration: A Survey

2020

View full text Add to dashboard Cite

The breakdown of Dennard scaling has resulted in a decade-long stall of the maximum operating clock frequencies of processors. To mitigate this issue, computing shifted to multi-core devices. This introduced the need for programming flows and tools that facilitate the expression of workload parallelism at high abstraction levels. However, not all workloads are easily parallelizable, and the minor improvements to processor cores have not significantly increased single-threaded performance. Simultaneously, Instruction Level Parallelism in applications is considerably underexplored. This article reviews notable approaches that focus on exploiting this potential parallelism via automatic generation of specialized hardware from binary code. Although research on this topic spans over more than 20 years, automatic acceleration of software via translation to hardware has gained new importance with the recent trend toward reconfigurable heterogeneous platforms. We characterize this kind of binary acceleration approach and the accelerator architectures on which it relies. We summarize notable state-of-the-art approaches individually and present a taxonomy and comparison. Performance gains from 2.6× to 5.6× are reported, mostly considering bare-metal embedded applications, along with power consumption reductions between 1.3× and 3.9×. We believe the methodologies and results achievable by automatic hardware generation approaches are promising in the context of emergent reconfigurable devices.

show abstract