Convolutional Neural Networks have rapidly become the most successful machine-learning algorithm, enabling ubiquitous machine vision and intelligent decisions on even embedded computing systems. While the underlying arithmetic is structurally simple, compute and memory requirements are challenging. One of the promising opportunities is leveraging reduced-precision representations for inputs, activations, and model parameters. The resulting scalability in performance, power efficiency, and storage footprint provides interesting design compromises in exchange for a small reduction in accuracy. FPGAs are ideal for exploiting low-precision inference engines leveraging custom precisions to achieve the required numerical accuracy for a given application. In this article, we describe the second generation of the FINN framework, an end-to-end tool that enables design-space exploration and automates the creation of fully customized inference engines on FPGAs. Given a neural network description, the tool optimizes for given platforms, design targets, and a specific precision. We introduce formalizations of resource cost functions and performance predictions and elaborate on the optimization algorithms. Finally, we evaluate a selection of reduced precision neural networks ranging from CIFAR-10 classifiers to YOLO-based object detection on a range of platforms including PYNQ and AWS F1, demonstrating new unprecedented measured throughput at 50 TOp/s on AWS F1 and 5 TOp/s on embedded devices.
Requirements to embedded systems increase steadily. In parallel, also the performance of the processors used in these systems is improved leading to multithreaded and/or multicore architectures. Depending on the type of the embedded system, using Java is a more and more popular way for software development. In this paper, we present a Java benchmark suite that enables the comparison of different embedded Java platforms while solely assuming the availability of a CLDC API, the minimal configuration defined for the J2ME. The core of the benchmark suite consists of adapted realworld applications. Furthermore, the suite contains benchmarks to explore multi-core/multi-threaded systems. Hence, it is possible to determine the gain of a parallel execution platform compared to sequential execution. Additionally, the penalty of a sequential program running on a parallel platform can be measured. Our benchmarks are structured in micro, kernel, application, parallel, and streaming benchmarks.
Caching of complete methods has been suggested to simplify the determination of the worst-case execution time (WCET) in the presence of a memory hierarchy [9]. While this previous approach limits possible cache misses to method invocations and returns, it still assumes a conventional blocked organization of the cache memory. This paper proposes and evaluates a new approach organizing the cached methods within a linked list while tag matching is limited to a sliding window of at most three methods over this linked list. The main advantages of this approach are the avoidance of low block utilization by small methods through bump-pointer space allocation and a further simplification of the WCET analysis by an easy miss prediction based solely on call stack information available locally.
International audienceInteger addition is a pervasive operation in FPGA designs. The need for fast wide adders grows with the demand for large precisions as, for example, required for the implementation of IEEE-754 quadruple precision and eliptic-curve cryptography. The FPGA realization of fast and compact binary adders relies on hardware carry chains. These provide a natural implementation environment for the ripple-carry addition (RCA) scheme. As its latency grows linearly with the operand width, wide additions call for acceleration, which is quite reasonably achieved by addition schemes built from parallel RCA blocks. This study presents FPGA-specific arithmetic optimizations for the mapping of carry-select/increment adders targeting the hardware carry chains of modern FPGAs. Different trade-offs between latency and area are presented. The proposed architectures represent attractive alternatives to deeply pipelined RCA schemes
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.