Invited: Bambu: an Open-Source Research Framework for the High-Level Synthesis of Complex Applications

Ferrandi, Fabrizio; Castellana, Vito Giovanni; Curzel, Serena; Fezzardi, Pietro; Fiorito, Michele; Lattuada, Marco; Minutoli, Marco; Pilato, Christian; Tumeo, Antonino

doi:10.1109/dac18074.2021.9586110

Cited by 51 publications

(36 citation statements)

References 12 publications

(11 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…As both cfdlang and optionally also teil leave a gap in implementing a scalar type, base2 declares parametric types that model arbitrary-precision data types and abstract operations on them. Most prominently, we can use the contained ieee754 type to encode custom floating-point types that other HLS tools, like Bambu [11], can consume.…”

Section: 32mentioning

confidence: 99%

See 1 more Smart Citation

Automatic Creation of High-Bandwidth Memory Architectures from Domain-Specific Languages: The Case of Computational Fluid Dynamics

Soldavini,

Friebel,

Tibaldi

et al. 2022

Preprint

Self Cite

View full text Add to dashboard Cite

Numerical simulations are increasingly used for solving complex problems. Most of these algorithms are massively parallel and can benefit from the spatial parallelism offered by reconfigurable logic. Modern FPGA devices can benefit from high-bandwidth memory technologies, but most of these applications are memory-bound and require designers to craft advanced communication and memory architectures for efficient data movement and on-chip storage. This development process requires hardware design skills that are uncommon in domain-specific experts. In this paper, we propose an automated tool flow from a domainspecific language (DSL) to generate massively-parallel accelerators on FPGA to address these challenges. We use the case of computational fluid dynamics (CFD) as a paradigmatic example. Our flow starts from the high-level specification of tensor operations and combines an MLIR-based compiler with an in-house hardware generation flow to automatically design systems. These systems integrate several parallel accelerators that operate on independent data and a specialized memory architecture that moves data efficiently, aiming at fully exploiting the available CPU-FPGA bandwidth for data transfers. We simulated applications with millions of elements, achieving up to 100 GFLOPS with one compute unit and custom precision when targeting a Xilinx Alveo U280. Our FPGA implementation is almost 25× more energy efficient than Intel implementations. We also discuss how to address practical limitations when scaling up the parallelism with multiple computing units on the same FPGA board.

show abstract

Section: 32mentioning

confidence: 99%

“…Specifically, we used the ap_fixed library to specify these formats so that they can be automatically synthesized by Vitis HLS. So, these optimizations are compatible with any HLS tool (like Bambu [11]) that can synthesize these formats. This step brings with it considerable advantages.…”

Section: Resource Optimizations and Multiple Compute Unitsmentioning

confidence: 99%

Automatic Creation of High-Bandwidth Memory Architectures from Domain-Specific Languages: The Case of Computational Fluid Dynamics

Soldavini,

Friebel,

Tibaldi

et al. 2022

Preprint

Self Cite

View full text Add to dashboard Cite

show abstract

“…Other parameterizable hardware description languages such as Chisel [3] and PyMTL [20] can be used as hardware generators that produce synthesizable Verilog, and the output Verilog can then be used as inputs to SNS. HLS designs can also be indirectly supported by using tools such as Bambu [8] to generate the synthesizable HDL.…”

Section: Usage Modelmentioning

confidence: 99%

SNS's not a synthesizer

Kjellqvist

Wills

2022

Proceedings of the 49th Annual International Symposium on Computer Architecture

View full text Add to dashboard Cite

The number of transistors that can fit on one monolithic chip has reached billions to tens of billions in this decade thanks to Moore's Law. With the advancement of every technology generation, the transistor counts per chip grow at a pace that brings about exponential increase in design time, including the synthesis process used to perform design space explorations. Such a long delay in obtaining synthesis results hinders an efficient chip development process, significantly impacting time-to-market. In addition, these large-scale integrated circuits tend to have larger and higher-dimension design spaces to explore, making it prohibitively expensive to obtain physical characteristics of all possible designs using traditional synthesis tools.In this work, we propose a deep-learning-based synthesis predictor called SNS (SNS's not a Synthesizer), that predicts the area, power, and timing physical characteristics of a broad range of designs at two to three orders of magnitude faster than the Synopsys Design Compiler while providing on average a 0.4998 RRSE (root relative square error). We further evaluate SNS via two representative case studies, a general-purpose out-of-order CPU case study using RISC-V Boom open-source design and an accelerator case study using an in-house Chisel implementation of DianNao, to demonstrate the capabilities and validity of SNS. CCS CONCEPTS• Hardware → Integrated circuits; High-level and registertransfer level synthesis; • Computing methodologies → Neural networks.

show abstract

“…The frontend then generates an LLVM IR as output, which is the starting point for hardware generation. The SODA backend integrates Bambu [37], a state-of-the-art open-source HLS tool, to generate the hardware accelerators. To compile code that will be executed on a host processor, instead, SODA uses standard LLVM tools.…”

Section: The Soda Synthesizermentioning

confidence: 99%

End-to-End Synthesis of Dynamically Controlled Machine Learning Accelerators

Curzel

Agostini

Castellana

et al. 2022

IEEE Trans. Comput.

Self Cite

View full text Add to dashboard Cite

Edge systems are required to autonomously make real-time decisions based on large quantities of input data under strict power, performance, area, and other constraints. Meeting these constraints is only possible by specializing systems through hardware accelerators purposefully built for machine learning and data analysis algorithms. However, data science evolves at a quick pace, and manual design of custom accelerators has high non-recurrent engineering costs: general solutions are needed to automatically and rapidly transition from the formulation of a new algorithm to the deployment of a dedicated hardware implementation. Our solution is the SOftware Defined Architectures (SODA) Synthesizer, an end-to-end, multi-level, modular, extensible compiler toolchain providing a direct path from machine learning tools to hardware. The SODA Synthesizer frontend is based on the multilevel intermediate representation (MLIR) framework; it ingests pre-trained machine learning models, identifies kernels suited for acceleration, performs high-level optimizations, and prepares them for hardware synthesis. In the backend, SODA leverages state-of-the-art high-level synthesis techniques to generate highly efficient accelerators, targeting both field programmable devices (FPGAs) and applicationspecific circuits (ASICs). In this paper, we describe how the SODA Synthesizer can also assemble the generated accelerators (based on the finite state machine with datapath model) in a custom system driven by a distributed controller, building a coarse-grained dataflow architecture that does not require a host processor to orchestrate parallel execution of multiple accelerators. We show the effectiveness of our approach by automatically generating ASIC accelerators for layers of popular deep neural networks (DNNs). Our high-level optimizations result in up to 74x speedup on isolated accelerators for individual DNN layers, and our dynamically scheduled architecture yields an additional 3x performance improvement when combining accelerators to handle streaming inputs.

show abstract

Invited: Bambu: an Open-Source Research Framework for the High-Level Synthesis of Complex Applications

Cited by 51 publications

References 12 publications

Automatic Creation of High-Bandwidth Memory Architectures from Domain-Specific Languages: The Case of Computational Fluid Dynamics

Automatic Creation of High-Bandwidth Memory Architectures from Domain-Specific Languages: The Case of Computational Fluid Dynamics

SNS's not a synthesizer

End-to-End Synthesis of Dynamically Controlled Machine Learning Accelerators

Contact Info

Product

Resources

About