Experiments and optimizations for TVM on RISC-V Architectures with P Extension

Chen, Yi-Ru; Liao, Hui-Hsin; Chang, Chun‐Hsiang; Lin, Che-Chia; Lee, Chao-Lin; Chang, Yu-Ting; Yang, Chun‐Chieh; Lee, Jenq-Kuen

doi:10.1109/vlsi-dat49148.2020.9196477

Cited by 6 publications

(3 citation statements)

References 8 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…In recent years, many applications have begun to leverage the advantages of RISC-V for optimization and acceleration in lower-power embedded systems. For example, in [32], the utilization of the P extension of RISC-V led to accelerated model execution based on TVM. The acceleration achieved through the P-extension enables faster inference computations.…”

Section: Related Work and Discussionmentioning

confidence: 99%

Case Study: Optimization Methods With TVM Hybrid-OP on RISC-V Packed SIMD

Yu,

Yuan,

Chen

et al. 2024

IEEE Access

Self Cite

View full text Add to dashboard Cite

In recent years, considerable research has focused on the use of custom hardware to accelerate deep learning on edge devices. However, the end-to-end flow of deep learning includes preprocessing and postprocessing. Deep learning hardware accelerators cannot accelerate these operations, which consequently becomes a performance bottleneck in the execution flow. In this study, we propose optimization methods to improve preprocessing and postprocessing at the edge devices. For this purpose, we adopt Tensor Virtual Machine (TVM), an end-to-end machine learning compiler framework. TVM provides hybrid script, which is a frontend language that allows users to write programs for preprocessing and postprocessing. We propose rewriting strategies to improve the performance of operators written in hybrid script through the RISC-V Packed SIMD extension (P extension). RISC-V is an open instruction set architecture (ISA) that provides base instructions and many extensions for different use cases. The P extension defines specific subword single-instruction multiple-data (SIMD) instructions that allow complex computations to be efficiently performed on edge devices. In this study, we design custom instructions based on the RISC-V P extension for rewriting strategies to accelerate deep learning operations. Experimental results indicate that our methods improve performance by a factor of 1.28 to 15.29.

show abstract

Section: Related Work and Discussionmentioning

confidence: 99%

Case Study: Optimization Methods With TVM Hybrid-OP on RISC-V Packed SIMD

Yu,

Yuan,

Chen

et al. 2024

IEEE Access

Self Cite

View full text Add to dashboard Cite

show abstract

“…Compared with Ragan-Kelley et al [23], which adopts stochastic search, we prefer an exhaustive hardware-in-the-loop approach, capable of finding the best optimization for the problem by direct performance profiling. MCU-oriented autotuning tools, like uTVM 8 and the work of Chen et al [8], are currently limited to inferenceand currently do not achieve a speed comparable to hand-tuned libraries, like CMSIS-NN [13] and TinyEngine [15]. PULP (parallel ultra-low power) is a computational platform for energy-efficient and scalable edge computing based on RISC-V cores [26].…”

Section: Related Workmentioning

confidence: 99%

PULP-TrainLib: Enabling On-Device Training for RISC-V Multi-core MCUs Through Performance-Driven Autotuning

Nadalini

Rusci²,

Tagliavini³

et al. 2022

Lecture Notes in Computer Science

View full text Add to dashboard Cite

An open challenge in making Internet-of-Things sensor nodes "smart" and self-adaptive is to enable on-chip Deep Neural Network (DNN) training on Ultra-Low-Power (ULP) microcontroller units (MCUs). To this aim, we present a framework, based on PULP-TrainLib, to deploy DNN training tasks on RISC-V-based Parallel-ULP (PULP) MCUs. PULP-TrainLib is a library of parallel software DNN primitives enabling the execution of forward and backward steps on PULP MCUs. To optimize PULP-TrainLib's kernels, we propose a strategy to automatically select and configure (autotune) the fastest among a set of tiling options and optimized floating-point matrix multiplication kernels, according to the tensor shapes of every DNN layer. Results on an 8-core RISC-V MCU show that our auto-tuned primitives improve MAC/clk by up to 2.4× compared to "one-size-fits-all" matrix multiplication, achieving up to 4.39 MAC/clk -36.6× better than a commercial STM32L4 MCU executing the same DNN layer training workload. Furthermore, our strategy proves to be 30.7× faster than AIfES, a state-of-the-art training library for MCUs, while training a complete TinyML model.

show abstract

“…[42] added support for the RISC-V packed SIMD P draft extension [41] to the RISC-V 64-bit CVA6 processor [43]. [44] presented an end-to-end compiler to optimize the code generation behavior of quantized neural networks that leverages the RISC-V P ex- tension [41] to optimize quantized neural network applications.…”

Section: State Of the Artmentioning

confidence: 99%