Luca Bertaccini scite author profile

In the Internet-Of-Things (IoT) domain, microcontrollers (MCUs) are used to collect and process data coming from sensors and transmit them to the cloud. Applications that require the range and precision of floating-point (FP) arithmetic can be implemented using efficient hardware floating-point units (FPUs) or by using software emulation. FPUs optimize performance and code size, whilst software emulation minimizes the hardware cost. We present a new area-optimized, IEEE 754-compliant RISC-V FPU (Tiny-FPU), and we explore the area, code size, performance, power, and energy efficiency of three different implementations of the RISC-V Instruction Set Architecture double and singleprecision FP extensions on an MCU-class processor. We show that Tiny-FPU, in its double and single-precision versions, is respectively 54% and 37% smaller than a double and singleprecision FPU optimized for performance and energy efficiency. When coupling a RISC-V core with Tiny-FPU, we achieve up to 18.5× and 15.5× speedups with respect to the same core emulating FP operations via software.

show abstract

RedMulE: A Compact FP16 Matrix-Multiplication Accelerator for Adaptive Deep Learning on RISC-V-Based Ultra-Low-Power SoCs

Tortorella

Bertaccini

Rossi

et al. 2022

View full text Add to dashboard Cite

The fast proliferation of extreme-edge applications using Deep Learning (DL) based algorithms required dedicated hardware to satisfy extreme-edge applications' latency, throughput, and precision requirements. While inference is achievable in practical cases, online finetuning and adaptation of general DL models are still highly challenging. One of the key stumbling stones is the need for parallel floating-point operations, which are considered unaffordable on sub-100 mW extreme-edge SoCs. We tackle this problem with RedMulE (Reduced-precision matrix Multiplication Engine), a parametric low-power hardware accelerator for FP16 matrix multiplications -the main kernel of DL training and inference -conceived for tight integration within a cluster of tiny RISC-V cores based on the PULP (Parallel Ultra-Low-Power) architecture. In 22 nm technology, a 32-FMA RedMulE instance occupies just 0.07 mm 2 (14% of an 8-core RISC-V cluster) and achieves up to 666 MHz maximum operating frequency, for a throughput of 31.6 MAC/cycle (98.8% utilization). We reach a cluster-level power consumption of 43.5 mW and a full-cluster energy efficiency of 688 16-bit GFLOPS/W. Overall, RedMulE features up to 4.65× higher energy efficiency and 22× speedup over SW execution on 8 RISC-V cores.

show abstract

scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.

Contact Info

customersupport@researchsolutions.com

10624 S. Eastern Ave., Ste. A-614

Henderson, NV 89052, USA

This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.

Blog Terms and Conditions API Terms Privacy Policy Contact Cookie Preferences Do Not Sell or Share My Personal Information

Made with 💙 for researchers

Part of the Research Solutions Family.

Luca Bertaccini

MiniFloat-NN and ExSdotp: An ISA Extension and a Modular Open Hardware Unit for Low-Precision Training on RISC-V Cores

Soft Tiles: Capturing Physical Implementation Flexibility for Tightly-Coupled Parallel Processing Clusters

To Buffer, or Not to Buffer? A Case Study on FFT Accelerators for Ultra-Low-Power Multicore Clusters

Tiny-FPU: Low-Cost Floating-Point Support for Small RISC-V MCU Cores

RedMulE: A Compact FP16 Matrix-Multiplication Accelerator for Adaptive Deep Learning on RISC-V-Based Ultra-Low-Power SoCs

Contact Info

Product

Resources

About