Optimizing Bit-Serial Matrix Multiplication for Reconfigurable Computing

Umuroglu, Yaman; Conficconi, Davide; Rasnayake, Lahiru; Preußer, Thomas B.; Själander, Magnus

doi:10.1145/3337929

Cited by 16 publications

(10 citation statements)

References 20 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Indeed, they tremendously ease the generation of highly customizable modules and enable the birth of System on Chip generators [4], [5]. However, architecture modeling is often case-specific [6], and the support for automatic DSE is still missing. Nowadays, Computer-Aided Design (CAD) tools are vital to increase productivity, ensure correctness and performance of intricate designs, and enable a higher level of complexity.…”

Section: Introductionmentioning

confidence: 99%

Dovado: An Open-Source Design Space Exploration Framework

Paletti

Conficconi

Santambrogio

2021

2021 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)

Self Cite

View full text Add to dashboard Cite

Traditional hardware development exploits description languages such as VHDL and (System)Verilog to produce highly parametrizable RTL designs. Different parameter values yield different utilization-frequency trade-offs, and handtuning is not feasible with a non-trivial amount of parameters. Generally, the Computer-Aided Design (CAD) literature proposes approaches that mainly tackle automatic exploration without combining a design automation feature. Hence, this work proposes Dovado, an open-source CAD tool for design space exploration (DSE) tailored for FPGAs-based designs. Starting from VHDL/(System)Verilog, Dovado exploits Vivado and supports the hardware developer for an exact exploration of a given set of parameters or a DSE where it returns the nondominated set of configuration points. In this work, we exploit a multi-objective integer formulation and Non-Dominated Sorting Genetic Algorithm (NSGA)-II for a fast DSE. Moreover, we propose an approximation model for the NSGA-II fitness function to decide whether Vivado or a Nadaraya-Watson model should estimate the optimization metrics.

show abstract

Section: Introductionmentioning

confidence: 99%

Dovado: An Open-Source Design Space Exploration Framework

Paletti

Conficconi

Santambrogio

2021

2021 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)

Self Cite

View full text Add to dashboard Cite

show abstract

“…For that, we select TVM due to its proven functionality, active community, and support for quantized computations [8,9]. We furthermore choose bit-serial forms of computation on ARM processors, which allows for arbitrary reduced-precision representations on various platforms [10]. The bit serial approach does not scale according to reduced data size and moreover, it does not seem to be bound by cache bandwidth, at least not in regard to the cache-bound model discussed in this work.…”

Section: Introductionmentioning

confidence: 99%

Understanding Cache Boundness of ML Operators on ARM Processors

Klein¹,

Gratl²,

Mücke³

et al. 2021

Preprint

View full text Add to dashboard Cite

Machine Learning (ML) compilers like TVM allow a fast and flexible deployment on embedded CPUs. This enables the use of non-standard operators, which are common in ML compression techniques. However, it is necessary to understand the limitations of typical compute-intense operators in ML workloads to design a proper solution. This is the first indetail analysis of dense and convolution operators, generated with TVM, that compares to the fundamental hardware limits of embedded ARM processors. Thereby it explains the gap between computational peak performance, theoretical and measured, and real-world state-of-the-art results, created with TVM and open-BLAS 1 . Instead, one can see that single-precision general matrix multiply (GEMM) and convolutions are bound by L1-cacheread bandwidth. Explorations of 8-bit and bit-serial quantized operators show that quantization can be used to achieve relevant speedups compared to cache-bound floating-point operators. However, the performance of quantized operators highly depends on the interaction between data layout and bit packing.

show abstract

“…Architectures that are more recent are reported in [30][31][32][33]. In [30] a generic system architecture is proposed for binary string comparisons that is based on a Virtex Ul-traScale+ FPGA.…”

mentioning

confidence: 99%

“…A LUT-efficient compressor architecture for performing population count operation is described in [31] to be used in matrix multiplications of variable precision. The authors started with a population count unit built as a tree of 6:3 LUTs and adders requiring a large number of LUTs and many stages to pipeline the adder tree.…”

mentioning

confidence: 99%

“…It is clear that none of the analyzed related work targets specifically CPSs incorporating the MicroBlaze processor, which can be essential for cost-sensitive applications. A LUT-efficient compressor architecture for performing population count operation is described in [31] to be used in matrix multiplications of variable precision. The authors started with a population count unit built as a tree of 6:3 LUTs and adders requiring a large number of LUTs and many stages to pipeline the adder tree.…”

mentioning

confidence: 99%

See 1 more Smart Citation

Accelerating Population Count with a Hardware Co-Processor for MicroBlaze

Skliarova

2021

JLPEA

View full text Add to dashboard Cite

This paper proposes a Field-Programmable Gate Array (FPGA)-based hardware accelerator for assisting the embedded MicroBlaze soft-core processor in calculating population count. The population count is frequently required to be executed in cyber-physical systems and can be applied to large data sets, such as in the case of molecular similarity search in cheminformatics, or assisting with computations performed by binarized neural networks. The MicroBlaze instruction set architecture (ISA) does not support this operation natively, so the count has to be realized as either a sequence of native instructions (in software) or in parallel in a dedicated hardware accelerator. Different hardware accelerator architectures are analyzed and compared to one another and to implementing the population count operation in MicroBlaze. The achieved experimental results with large vector lengths (up to 217) demonstrate that the best hardware accelerator with DMA (Direct Memory Access) is ~31 times faster than the best software version running on MicroBlaze. The proposed architectures are scalable and can easily be adjusted to both smaller and bigger input vector lengths. The entire system was implemented and tested on a Nexys-4 prototyping board containing a low-cost/low-power Artix-7 FPGA.

show abstract

Optimizing Bit-Serial Matrix Multiplication for Reconfigurable Computing

Cited by 16 publications

References 20 publications

Dovado: An Open-Source Design Space Exploration Framework

Dovado: An Open-Source Design Space Exploration Framework

Understanding Cache Boundness of ML Operators on ARM Processors

Accelerating Population Count with a Hardware Co-Processor for MicroBlaze

Contact Info

Product

Resources

About