Compressing deep neural networks on FPGAs to binary and ternary precision with <tt>hls4ml</tt>

Ngadiuba, J.; Lončar, Vladimir; Pierini, M.; Summers, S.; Guglielmo, Giuseppe Di; Duarte, J.; Harris, P.; Rankin, D.; Jindariani, S.; Liu, Mia; Pedro, K.; Tran, N. V.; Kreinar, Edward; Sagear, Sheila; Wu, Z.; Hoang, Duc

doi:10.1088/2632-2153/aba042

Cited by 71 publications

(49 citation statements)

References 15 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…This allows compression of the model size, but to some extent sacrifices accuracy. Recently, support for binary and ternary precision DNNs [43] trained quantization-aware has been included in the library. This greatly reduces the model size, but requiring such an extremely lowprecision of each parameter type sacrifices accuracy and generalization.…”

Section: Motivationmentioning

confidence: 99%

“…For example, when using a quantizer with a given alpha parameter (i.e., scaled weights), hls4ml inserts an operation to re-scale the layer output. For binary and ternary weights and activations, the same strategies as in [43] are used. With binary layers, the arithmetical value of -1 is encoded as 0, allowing the product to be expressed as an XNOR operation.…”

Section: Ultra Low-latency Quantized Model On Fpga Hardwarementioning

confidence: 99%

See 1 more Smart Citation

Automatic heterogeneous quantization of deep neural networks for low-latency inference on the edge for particle detectors

et al. 2021

Self Cite

View full text Add to dashboard Cite

Although the quest for more accurate solutions is pushing deep learning research towards larger and more complex algorithms, edge devices demand efficient inference and therefore reduction in model size, latency and energy consumption. One technique to limit model size is quantization, which implies using fewer bits to represent weights and biases. Such an approach usually results in a decline in performance. Here, we introduce a method for designing optimally heterogeneously quantized versions of deep neural network models for minimum-energy, high-accuracy, nanosecond inference and fully automated deployment on chip. With a per-layer, per-parameter type automatic quantization procedure, sampling from a wide range of quantizers, model energy consumption and size are minimized while high accuracy is maintained. This is crucial for the event selection procedure in proton-proton collisions at the CERN Large Hadron Collider, where resources are strictly limited and a latency of O(1) µs is required. Nanosecond inference and a resource consumption reduced by a factor of 50 when implemented on field-programmable gate array hardware are achieved. FIG.I. An ultra-compressed deep neural network for particle identification on a Xilinx FPGA.

show abstract

Section: Motivationmentioning

confidence: 99%

Section: Ultra Low-latency Quantized Model On Fpga Hardwarementioning

confidence: 99%

Automatic heterogeneous quantization of deep neural networks for low-latency inference on the edge for particle detectors

et al. 2021

Self Cite

View full text Add to dashboard Cite

show abstract

“…Development of ML models deployable to FPGA-based L1T systems is helped by tools for automatic network-to-circuit conversion such as hls4ml. Using hls4ml, several solutions for HEP-specific tasks (e.g., jet tagging) have been provided (Duarte et al, 2018;Coelho et al, 2020;Di Guglielmo et al, 2020;Summers et al, 2020), exploiting models with simpler architectures than what is shown here. This tool has been applied extensively for tasks in the HL-LHC upgrade of the CMS L1T system, including an autoencoder for anomaly detection, and DNNs for muon energy regression and identification, tau lepton identification, and vector boson fusion event classification (CMS Collaboration, 2020).…”

Section: Related Workmentioning

confidence: 99%

“…On the other hand, the quantized model uses more LUTs, mainly for the multiplications in the GARNET encoders and decoders, as discussed in Section 4. However, it is known that the expected LUT usage tend to be overestimated in Vivado HLS, while the expected DSP usage tends to be accurate (Duarte et al, 2018;Di Guglielmo et al, 2020). The DSP usage of 3.1 × 10 3 for the continuous model is well within the limit of the target device, but is more than what is available on a single die slice (2.8 × 10 3 ) (Xilinx, 2020).…”

Section: Model Synthesis and Performancementioning

confidence: 99%

“…Extensions to convolutional and recurrent neural networks are in development. The library comes with handles to compress the model by quantization, up to binary and ternary precision (Di Guglielmo et al, 2020). Recently, support for QKERAS (Qkeras, 2020) models has been added, in order to allow for quantization-aware training of models (Coelho et al, 2020).…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Distance-Weighted Graph Neural Networks on FPGAs for Real-Time Particle Reconstruction in High Energy Physics

et al. 2021

Self Cite

View full text Add to dashboard Cite

Graph neural networks have been shown to achieve excellent performance for several crucial tasks in particle physics, such as charged particle tracking, jet tagging, and clustering. An important domain for the application of these networks is the FGPA-based first layer of real-time data filtering at the CERN Large Hadron Collider, which has strict latency and resource constraints. We discuss how to design distance-weighted graph networks that can be executed with a latency of less than one μs on an FPGA. To do so, we consider a representative task associated to particle reconstruction and identification in a next-generation calorimeter operating at a particle collider. We use a graph network architecture developed for such purposes, and apply additional simplifications to match the computing constraints of Level-1 trigger systems, including weight quantization. Using the hls4ml library, we convert the compressed models into firmware to be implemented on an FPGA. Performance of the synthesized models is presented both in terms of inference accuracy and resource usage.

show abstract

A unified test data volume compression scheme for circular scan architecture using hosted cuckoo optimization

Shukla,

Mayet,

Raja

et al. 2023

J Supercomput

View full text Add to dashboard Cite

Compressing deep neural networks on FPGAs to binary and ternary precision with `hls4ml`

Cited by 71 publications

References 15 publications

Automatic heterogeneous quantization of deep neural networks for low-latency inference on the edge for particle detectors

Automatic heterogeneous quantization of deep neural networks for low-latency inference on the edge for particle detectors

Distance-Weighted Graph Neural Networks on FPGAs for Real-Time Particle Reconstruction in High Energy Physics

A unified test data volume compression scheme for circular scan architecture using hosted cuckoo optimization

Contact Info

Product

Resources

About

Compressing deep neural networks on FPGAs to binary and ternary precision with hls4ml

Cited by 71 publications

References 15 publications

Automatic heterogeneous quantization of deep neural networks for low-latency inference on the edge for particle detectors

Automatic heterogeneous quantization of deep neural networks for low-latency inference on the edge for particle detectors

Distance-Weighted Graph Neural Networks on FPGAs for Real-Time Particle Reconstruction in High Energy Physics

A unified test data volume compression scheme for circular scan architecture using hosted cuckoo optimization

Contact Info

Product

Resources

About

Compressing deep neural networks on FPGAs to binary and ternary precision with `hls4ml`