Thomas B. Preußer scite author profile

Convolutional Neural Networks have rapidly become the most successful machine-learning algorithm, enabling ubiquitous machine vision and intelligent decisions on even embedded computing systems. While the underlying arithmetic is structurally simple, compute and memory requirements are challenging. One of the promising opportunities is leveraging reduced-precision representations for inputs, activations, and model parameters. The resulting scalability in performance, power efficiency, and storage footprint provides interesting design compromises in exchange for a small reduction in accuracy. FPGAs are ideal for exploiting low-precision inference engines leveraging custom precisions to achieve the required numerical accuracy for a given application. In this article, we describe the second generation of the FINN framework, an end-to-end tool that enables design-space exploration and automates the creation of fully customized inference engines on FPGAs. Given a neural network description, the tool optimizes for given platforms, design targets, and a specific precision. We introduce formalizations of resource cost functions and performance predictions and elaborate on the optimization algorithms. Finally, we evaluate a selection of reduced precision neural networks ranging from CIFAR-10 classifiers to YOLO-based object detection on a range of platforms including PYNQ and AWS F1, demonstrating new unprecedented measured throughput at 50 TOp/s on AWS F1 and 5 TOp/s on embedded devices.

show abstract

The embedded Java benchmark suite JemBench

Schoeberl

Preußer

Uhrig

2010

View full text Add to dashboard Cite

Requirements to embedded systems increase steadily. In parallel, also the performance of the processors used in these systems is improved leading to multithreaded and/or multicore architectures. Depending on the type of the embedded system, using Java is a more and more popular way for software development. In this paper, we present a Java benchmark suite that enables the comparison of different embedded Java platforms while solely assuming the availability of a CLDC API, the minimal configuration defined for the J2ME. The core of the benchmark suite consists of adapted realworld applications. Furthermore, the suite contains benchmarks to explore multi-core/multi-threaded systems. Hence, it is possible to determine the gain of a parallel execution platform compared to sequential execution. Additionally, the penalty of a sequential program running on a parallel platform can be measured. Our benchmarks are structured in micro, kernel, application, parallel, and streaming benchmarks.

show abstract

Bump-pointer method caching for embedded Java processors

Preußer

Zabel

Spallek

2007

View full text Add to dashboard Cite

Caching of complete methods has been suggested to simplify the determination of the worst-case execution time (WCET) in the presence of a memory hierarchy [9]. While this previous approach limits possible cache misses to method invocations and returns, it still assumes a conventional blocked organization of the cache memory. This paper proposes and evaluates a new approach organizing the cached methods within a linked list while tag matching is limited to a sliding window of at most three methods over this linked list. The main advantages of this approach are the avoidance of low block utilization by small methods through bump-pointer space allocation and a further simplification of the WCET analysis by an easy miss prediction based solely on call stack information available locally.

show abstract

FPGA-Specific Arithmetic Optimizations of Short-Latency Adders

Nguyen

Pasca

Preußer

2011

View full text Add to dashboard Cite

International audienceInteger addition is a pervasive operation in FPGA designs. The need for fast wide adders grows with the demand for large precisions as, for example, required for the implementation of IEEE-754 quadruple precision and eliptic-curve cryptography. The FPGA realization of fast and compact binary adders relies on hardware carry chains. These provide a natural implementation environment for the ripple-carry addition (RCA) scheme. As its latency grows linearly with the operand width, wide additions call for acceleration, which is quite reasonably achieved by addition schemes built from parallel RCA blocks. This study presents FPGA-specific arithmetic optimizations for the mapping of carry-select/increment adders targeting the hardware carry chains of modern FPGAs. Different trade-offs between latency and area are presented. The proposed architectures represent attractive alternatives to deeply pipelined RCA schemes

show abstract

Next-generation massively parallel short-read mapping on FPGAs

Knodel

Preußer

Spallek

2011

View full text Add to dashboard Cite

scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.

Contact Info

customersupport@researchsolutions.com

10624 S. Eastern Ave., Ste. A-614

Henderson, NV 89052, USA

This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.

Blog Terms and Conditions API Terms Privacy Policy Contact Cookie Preferences Do Not Sell or Share My Personal Information

Made with 💙 for researchers

Part of the Research Solutions Family.