Trace-Based Reconfigurable Acceleration with Data Cache and External Memory Support

Paulino, Nuno; Ferreira, Joao Canas; Cardoso, João M. P.

doi:10.1109/ispa.2014.29

Cited by 5 publications

(4 citation statements)

References 20 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…For earlier row-oriented 2D RPUs, the translation was based on directly implementing the set of CDFGs as a configurable datapath. We started in [7] without support for pipelining, and added loop pipelining support in [9]. Our recent RPU is a 1D architecture and executes modulo scheduled loops.…”

Section: B Mapping Stagesmentioning

confidence: 99%

“…We have considered different system organizations. The system in [9] (Fig. 3(a)) uses local memory for code, external memory for data, and a custom dual-port cache for the RPU, which can access the full range of the GPP's data.…”

Section: A System Level Architecturementioning

confidence: 99%

“…4 shows two examples of RPUs implementing the dot product Megablock presented in Section II. The 2D RPUs [9] use a single-configuration per Megablock, as we do not yet consider temporal partitioning of Megablocks, and multiple configurations are needed to deal with more than one Megablock. This RPU is obtained by a direct translation of the Megablock CDFGs into (pipelined) datapaths.…”

Section: B Rpu Architecture and Generationmentioning

confidence: 99%

See 2 more Smart Citations

Transparent Acceleration of Program Execution using Reconfigurable Hardware

Paulino

Ferreira

Bispo

et al. 2015

Design, Automation &Amp; Test in Europe Conference &Amp; Exhibition (DATE), 2015

View full text Add to dashboard Cite

The acceleration of applications, running on a general purpose processor (GPP), by mapping parts of their execution to reconfigurable hardware is an approach which does not involve program's source code and still ensures program portability over different target reconfigurable fabrics. However, the problem is very challenging, as suitable sequences of GPP instructions need to be translated/mapped to hardware, possibly at runtime. Thus, all mapping steps, from compiler analysis and optimizations to hardware generation, need to be both efficient and fast. This paper introduces some of the most representative approaches for binary acceleration using reconfigurable hardware, and presents our binary acceleration approach and the latest results. Our approach extends a GPP with a Reconfigurable Processing Unit (RPU), both sharing the data memory. Repeating sequences of GPP instructions are migrated to an RPU composed of functional units and interconnect resources, and able to exploit instruction-level parallelism, e.g., via loop pipelining. Although we envision a fully dynamic system, currently the RPU resources are selected and organized offline using execution trace information. We present implementation prototypes of the system on a Spartan-6 FPGA with a MicroBlaze as GPP and the very encouraging results achieved with a number of benchmarks.

show abstract

Section: B Mapping Stagesmentioning

confidence: 99%

Section: A System Level Architecturementioning

confidence: 99%

Section: B Rpu Architecture and Generationmentioning

confidence: 99%

See 1 more Smart Citation

Transparent Acceleration of Program Execution using Reconfigurable Hardware

Paulino

Ferreira

Bispo

et al. 2015

Design, Automation &Amp; Test in Europe Conference &Amp; Exhibition (DATE), 2015

View full text Add to dashboard Cite

show abstract

“…Results obtained with a prototype implementation show that the approach is viable and can be used effectively to handle arbitrary hotspot functions, not just those located in shared library routines. Moreover, as discussed in Section 4, the approach can be extended to handle hotspots that are not necessarily subroutines of the original code (such as the "megablocks" of [16,17]).…”

mentioning

confidence: 99%

Transparent Control Flow Transfer between CPU and Accelerators for HPC

Granhão

Ferreira

2021

Electronics

View full text Add to dashboard Cite

Heterogeneous platforms with FPGAs have started to be employed in the High-Performance Computing (HPC) field to improve performance and overall efficiency. These platforms allow the use of specialized hardware to accelerate software applications, but require the software to be adapted in what can be a prolonged and complex process. The main goal of this work is to describe and evaluate mechanisms that can transparently transfer the control flow between CPU and FPGA within the scope of HPC. Combining such a mechanism with transparent software profiling and accelerator configuration could lead to an automatic way of accelerating regular applications. In this work, a mechanism based on the ptrace system call is proposed, and its performance on the Intel Xeon+FPGA platform is evaluated. The feasibility of the proposed approach is demonstrated by a working prototype that performs the transparent control flow transfer of any function call to a matching hardware accelerator. This approach is more general than shared library interposition at the cost of a small time overhead in each accelerator use (about 1.3ms in the prototype implementation).

show abstract