Partitioning and Vectorizing Binary Applications for a Reconfigurable Vector Computer

Kenter, Tobias; Vaz, Gavin; Plessl, Christian

doi:10.1007/978-3-319-05960-0_13

Cited by 5 publications

(4 citation statements)

References 13 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Yet, programmable hardware such as FPGAs, as a platform for custom-built accelerator designs [Kenter et al 2012[Kenter et al , 2014Strzodka and Goddeke 2006], can make effective use of all of these, but also entirely custom number formats. Developers can specify the number of exponent and mantissa bits and trade off precision against the amount of memory blocks required to store values and the number of logic elements required to perform arithmetic operations on them.…”

Section: Approximate Computingmentioning

confidence: 99%

Accurate Sampling with Noisy Forces from Approximate Computing

et al. 2020

Self Cite

View full text Add to dashboard Cite

In scientific computing, the acceleration of atomistic computer simulations by means of custom hardware is finding ever growing application. A major limitation, however, is that the high efficiency in terms of performance and low power consumption entails the massive usage of low-precision computing units. Here, based on the approximate computing paradigm, we present an algorithmic method to rigorously compensate for numerical inaccuracies due to low-accuracy arithmetic operations, yet still obtaining exact expectation values using a properly modified Langevin-type equation.

show abstract

Section: Approximate Computingmentioning

confidence: 99%

Accurate Sampling with Noisy Forces from Approximate Computing

et al. 2020

Self Cite

View full text Add to dashboard Cite

show abstract

“…Convey includes a compiler to target this Vector Personality by annotating source code with pragmas; however, we found it to be limited to simple array data structures and simple loop nesting patterns, which often requires significant code adaptations besides adding the vectorization pragmas. We fixed many of these shortcomings with the toolflow proposed in [26]; however, for the comparison of architectural overheads of the overlay, we wanted to achieve the best possible performance. Therefore, for this work, we designed all kernels by hand in assembly code, particularly exploiting on top of the capabilities of the automated toolflow additional opportunities as vector partitioning, vector register rotation, and enhanced reuse of partially computed addresses.…”

Section: Convey Hc-1 Platform With Vector Processor Overlaymentioning

confidence: 99%

Exploring Trade-Offs between Specialized Dataflow Kernels and a Reusable Overlay in a Stereo Matching Case Study

Kenter

Schmitz

Plessl

2015

International Journal of Reconfigurable Computing

Self Cite

View full text Add to dashboard Cite

FPGAs are known to permit huge gains in performance and efficiency for suitable applications but still require reduced design efforts and shorter development cycles for wider adoption. In this work, we compare the resulting performance of two design concepts that in different ways promise such increased productivity. As common starting point, we employ a kernel-centric design approach, where computational hotspots in an application are identified and individually accelerated on FPGA. By means of a complex stereo matching application, we evaluate two fundamentally different design philosophies and approaches for implementing the required kernels on FPGAs. In the first implementation approach, we designed individually specialized data flow kernels in a spatial programming language for a Maxeler FPGA platform; in the alternative design approach, we target a vector coprocessor with large vector lengths, which is implemented as a form of programmable overlay on the application FPGAs of a Convey HC-1. We assess both approaches in terms of overall system performance, raw kernel performance, and performance relative to invested resources. After compensating for the effects of the underlying hardware platforms, the specialized dataflow kernels on the Maxeler platform are around 3x faster than kernels executing on the Convey vector coprocessor. In our concrete scenario, due to trade-offs between reconfiguration overheads and exposed parallelism, the advantage of specialized dataflow kernels is reduced to around 2.5x.

show abstract

“…The work on instruction-set extensions is an example of partitioning usually limited to the migration to custom hardware of acyclic short sequences of instructions (see, e.g., [19]). In approaches where the RPU is loosely coupled to the GPP, as a co-processor, it is common to execute larger code sections (such as entire loops) [20][14][21] [22]. We briefly describe next the approaches most relevant to our work and present in TABLE I a summary of the reported speedups.…”

Section: Related Workmentioning

confidence: 99%

“…The binary is modified in order to add instructions for configuration and communication from/to DySER blocks. Another approach [22] maps loops in LLVM IR to a Vector Personality softcore for the Convey HC-1. At compile-time, a toolchain (including LLVM and Convey Compiler infrastructures) automatically identifies suitable loops (including outer-loops) for vectorization.…”

Section: Related Workmentioning

confidence: 99%

Transparent Acceleration of Program Execution using Reconfigurable Hardware

Paulino

Ferreira

Bispo

et al. 2015

Design, Automation &Amp; Test in Europe Conference &Amp; Exhibition (DATE), 2015

View full text Add to dashboard Cite

The acceleration of applications, running on a general purpose processor (GPP), by mapping parts of their execution to reconfigurable hardware is an approach which does not involve program's source code and still ensures program portability over different target reconfigurable fabrics. However, the problem is very challenging, as suitable sequences of GPP instructions need to be translated/mapped to hardware, possibly at runtime. Thus, all mapping steps, from compiler analysis and optimizations to hardware generation, need to be both efficient and fast. This paper introduces some of the most representative approaches for binary acceleration using reconfigurable hardware, and presents our binary acceleration approach and the latest results. Our approach extends a GPP with a Reconfigurable Processing Unit (RPU), both sharing the data memory. Repeating sequences of GPP instructions are migrated to an RPU composed of functional units and interconnect resources, and able to exploit instruction-level parallelism, e.g., via loop pipelining. Although we envision a fully dynamic system, currently the RPU resources are selected and organized offline using execution trace information. We present implementation prototypes of the system on a Spartan-6 FPGA with a MicroBlaze as GPP and the very encouraging results achieved with a number of benchmarks.

show abstract

Partitioning and Vectorizing Binary Applications for a Reconfigurable Vector Computer

Cited by 5 publications

References 13 publications

Accurate Sampling with Noisy Forces from Approximate Computing

Accurate Sampling with Noisy Forces from Approximate Computing

Exploring Trade-Offs between Specialized Dataflow Kernels and a Reusable Overlay in a Stereo Matching Case Study

Transparent Acceleration of Program Execution using Reconfigurable Hardware

Contact Info

Product

Resources

About