Buffer Placement and Sizing for High-Performance Dataflow Circuits

Josipović, Lana; Sheikhha, Shabnam; Guerrieri, Andrea; Ienne, Paolo; Cortadella, Jordi

doi:10.1145/3373087.3375314

Cited by 31 publications

(13 citation statements)

References 31 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…We developed an optimization approach [37] which allows for resource-optimal buffer placement and sizing, with the purpose of maximizing throughput of the performance-critical loops at the desired clock frequency. Our optimization strategy consists out of two main steps, as illustrated in Algorithm 2:…”

Section: Buffers and Performancementioning

confidence: 99%

From C/C++ Code to High-Performance Dataflow Circuits

Josipović

Guerrieri

Ienne

2022

IEEE Trans. Comput.-Aided Des. Integr. Circuits Syst.

Self Cite

View full text Add to dashboard Cite

High-level synthesis (HLS) tools typically generate statically scheduled datapaths. Static scheduling implies that the resulting circuits have a hard time exploiting parallelism in code with potential memory dependences, with control dependences, or where performance is limited by long latency control decisions. In this work, we describe an HLS approach which generates dynamically scheduled, dataflow circuits out of imperative code. We detail a complete set of rules to transform a standard compiler intermediate representation into a high-performance dataflow circuit that is able to dynamically resolve memory dependences and adapt its behavior on the fly to particular control flow decisions and operation latencies. Compared to a traditional HLS tool, the result is a different trade-off between performance and circuit complexity: statically scheduled circuits display the best performance per cost in regular applications, but general-purpose, irregular, and control-dominated computing tasks require the runtime flexibility of dynamic scheduling. Therefore, enabling dynamic behavior in HLS is key to dealing with the increasing computational demands of new contexts and broader application domains.

show abstract

Section: Buffers and Performancementioning

confidence: 99%

From C/C++ Code to High-Performance Dataflow Circuits

Josipović

Guerrieri

Ienne

2022

IEEE Trans. Comput.-Aided Des. Integr. Circuits Syst.

Self Cite

View full text Add to dashboard Cite

show abstract

“…The spatial CGRA mapping is challenging because we should balance all pipeline paths by inserting queues after mapping. The number of inserted queues can significantly impact architecture cost and throughput [24]. In this work, we also evaluate an asynchronous CGRA model, where unbalanced paths do not require registers, although throughput degradation may occur, as is showed in subsection II-D.…”

Section: Reshape -Architecture-independent Mappingmentioning

confidence: 99%

“…One approach to avoid FIFOs is asynchronous data-flow mapping. A PE can process an operation if and only if all input data are available at the correct time frame [24], [26]. The throughput can be smaller than one.…”

Section: Asynchronous Data-flowmentioning

confidence: 99%

RESHAPE: A Run-Time Dataflow Hardware-Based Mapping for CGRA Overlays

Vieira

Canesche

Bragança

et al. 2021

2021 IEEE International Symposium on Circuits and Systems (ISCAS)

View full text Add to dashboard Cite

Coarse-grained reconfigurable architectures (CGRA) are a power-efficient approach for hardware accelerators. However, there are few EDA tools for CGRA. We develop hardware-based placement and routing (P&R) for fully-pipelined CGRA mapped as an FPGA overlay. The key idea is to use the available FPGA resources to replicate several mapping units, thus exploring parallel execution, area/execution time trade-offs, and achieving near-optimal mapping solutions. Furthermore, our P&R provides portability and an incremental run-time approach. In comparison to VPR and CGRA-ME tools and a time-multiplexer approach, our spatial mapping reduces the P&R execution time, and it improves the performance up to hundreds of Gops/s by using fully-pipelined architectures.

show abstract

“…Dataflow circuits are fundamentally different: their schedules are not predetermined at compile time but devised as the circuit runs. Moreover, Lana [19,20] investigates how to create timing-efficient, high-throughput pipelines, and their MILP model is based on the theory of marked graphs and allows for resource-optimal buffer placement and sizing, with the purpose of maximizing throughput at the desired clock frequency. However, they are purely theoretical optimizations of the computational model without abstracting a generalized computational template for the computational model, which still requires a complete understanding of the circuit structure and does not improve the user's coding efficiency.…”

Section: Relate Workmentioning

confidence: 99%

A Highly Configurable High-Level Synthesis Functional Pattern Library

Huang

Gao

et al. 2021

Electronics

View full text Add to dashboard Cite

FPGA has recently played an increasingly important role in heterogeneous computing, but Register Transfer Level design flows are not only inefficient in design, but also require designers to be familiar with the circuit architecture. High-level synthesis (HLS) allows developers to design FPGA circuits more efficiently with a more familiar programming language, a higher level of abstraction, and automatic adaptation of timing constraints. When using HLS tools, such as Xilinx Vivado HLS, specific design patterns and techniques are required in order to create high-performance circuits. Moreover, designing efficient concurrency and data flow structures requires a deep understanding of the hardware, imposing more learning costs on programmers. In this paper, we propose a set of functional patterns libraries based on the MapReduce model, implemented by C++ templates, which can quickly implement high-performance parallel pipelined computing models on FPGA with specified simple parameters. The usage of this pattern library allows flexible adaptation of parallel and flow structures in algorithms, which greatly improves the coding efficiency. The contributions of this paper are as follows. (1) Four standard functional operators suitable for hardware parallel computing are defined. (2) Functional concurrent programming patterns are described based on C++ templates and Xilinx HLS. (3) The efficiency of this programming paradigm is verified with two algorithms with different complexity.

show abstract

Buffer Placement and Sizing for High-Performance Dataflow Circuits

Cited by 31 publications

References 31 publications

From C/C++ Code to High-Performance Dataflow Circuits

From C/C++ Code to High-Performance Dataflow Circuits

RESHAPE: A Run-Time Dataflow Hardware-Based Mapping for CGRA Overlays

A Highly Configurable High-Level Synthesis Functional Pattern Library

Contact Info

Product

Resources

About