HeteroFlow

Xiang, Shaojie; Lai, Yi-Hsiang; Zhou, Yuan; Chen, Hongzheng; Zhang, Niansong; Pal, Debjit; Zhang, Zhiru

doi:10.1145/3490422.3502369

Cited by 15 publications

(5 citation statements)

References 25 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Those kernels are functional but not optimized (e.g. buffer sizing, burst transfer), and could be improved as future work on heterogeneous optimization [35,40].…”

Section: Glue Code Generationmentioning

confidence: 99%

“…On the contrary, PREESM considers kernels as black boxes without assumptions on their inner C++ code. HeteroFlow [40] too takes Halide as an input, via the HeteroHalide [27] compiler. While HeteroFlow supports data transfer directives written by the designer, as well as buffer reuse, it does not embed a delay type system to perform buffer sizing.…”

Section: Hls Tools With Buffer Optimizationmentioning

confidence: 99%

See 1 more Smart Citation

Automated Buffer Sizing of Dataflow Applications in a High-level Synthesis Workflow

Honorat,

Dardaillon,

Miomandre

et al. 2024

ACM Trans. Reconfigurable Technol. Syst.

View full text Add to dashboard Cite

High-Level Synthesis (HLS) tools are mature enough to provide efficient code generation for computation kernels on FPGA hardware. For more complex applications, multiple kernels may be connected by a dataflow graph. Although some tools, such as Xilinx Vitis HLS, support dataflow directives, they lack efficient analysis methods to compute the buffer sizes between kernels in a dataflow graph. This paper proposes an original method to safely approximate such buffer sizes. The first contribution computes an initial overestimation of buffer sizes, wihout knowing the memory access patterns of kernels. The second contribution iteratively refines those buffer sizes thanks to cosimulation. Moreover, the paper introduces an open source framework using these methods to facilitate dataflow programming on FPGA using HLS. The proposed methods and framework have been tested on 7 dataflow applications, and outperform Vitis HLS cosimulation in 5 benchmarks, either in terms of BRAM and LUT usage, or in term of exploration time. In the 2 other benchmarks, our best method gets results similar to Vitis HLS. Last but not least, our method admits directed cycles in the application graphs.

show abstract

“…Those kernels are functional but not optimized (e.g. buffer sizing, burst transfer), and could be improved as future work on heterogeneous optimization [35,40].…”

Section: Glue Code Generationmentioning

confidence: 99%

Section: Hls Tools With Buffer Optimizationmentioning

confidence: 99%

Automated Buffer Sizing of Dataflow Applications in a High-level Synthesis Workflow

Honorat,

Dardaillon,

Miomandre

et al. 2024

ACM Trans. Reconfigurable Technol. Syst.

View full text Add to dashboard Cite

show abstract

“…Since (10) is nonlinear in T k,k−1 , it needs to be solved using an iterative Gauss-Newton method. Given an estimation of the relative transformation Tk,k−1 , an incremental update T (ξ) of the estimate can be parameterized with a twist coordinate ξ ∈ se(3).…”

Section: Sparse Image Alignmentmentioning

confidence: 99%

“…The IRC and FA hardware accelerators as well as the host code are developed using Xilinx SDSoC 2019.1 and HLS C/C++. We also utilize the state-of-the-art HeteroFlow [10] to develop IRC and FA designs. With HeteroFlow, we generate only the HLS C/C++ code for the accelerators, as it does not support SDSoC nor the Xilinx ZU9EG FPGA.…”

Section: A Experiments Setupmentioning

confidence: 99%

“…We first perform an end-to-end profiling analysis to identify the bottleneck of SVO. After that, we develop the FPGA accelerator to reduce the computational overhead by using state-of-the-art software-hardware co-design tools and HLS tools such as Xilinx Software-Defined System-on-Chip (SD-SoC) and HeteroFlow [10]. Then, we apply the proposed methods to the FPGA accelerator to reduce the data transfer overhead between CPU and FPGA accelerators.…”

mentioning

confidence: 99%

See 1 more Smart Citation

Exploring Sparse Visual Odometry Acceleration With High-Level Synthesis

Iordanou

Riley

et al. 2023

IEEE Access

View full text Add to dashboard Cite

Visual Odometry (VO) systems are widely used to determine the position and orientation of a robot or camera in an unknown environment. They are deployed on resource-constrained platforms, such as drones and Virtual Reality (VR) or Augmented Reality (AR) headsets. VO systems harnessing modern System-on-Chip (SoCs) with integrated Field Programmable Gate Array (FPGA) have the potential to improve the overall systems performance. This paper explores the FPGA acceleration of sparse VO kernels using High-level Synthesis (HLS) as this kind of VO system has been designed to use with lowpower SoCs. We show that both computational and data transfer overheads between the processing cores of the CPU of the SoC and the accelerators on the FPGA need to be optimized to obtain better end-to-end performance. This is a result of the additional data movement incurred when using an FPGA accelerator and also because of the sparse computational nature with predictable or random memory access patterns of the kernels involved. However, state-of-the-art HLS tools are not yet able to perform the required optimizations automatically because they usually assume that the kernels to be accelerated have dense computational patterns with regular memory access. In this paper we propose three, potentially generic, methods to reduce the data transfer between the CPU and the customised hardware kernels on the FPGA; these methods are: (a) approximation based on domain-specific knowledge, (b) image compression, and (c) the use of on-the-fly computation. We present a case study of the use of these methods on SVO, a state-of-the-art sparse VO system with a semi-direct front-end. We demonstrate that our proposed methods can reduce data transfer overhead to achieve better end-to-end performance and that they can be applied not only when using standard Xilinx HLS tools but also with other state-of-the-art HLS tools, such as HeteroFlow. Compared to the baseline performance of the original SVO software on an Arm CPU, our proposed methods assist the HLS and HeteroFlow designs to achieve a speedup of 2.4x and 2.14x, respectively, without noticeable accuracy loss. The HLS and HeteroFlow designs also achieve a 1.85x and 1.89x, respectively, improvement in energy efficiency on the SoC system used. Compared to the SVO software baseline running on the Intel Xeon CPU, our proposed methods assist the HLS and HeteroFlow designs to achieve 8.2x and 8.3x improvement in energy efficiency, respectively.

show abstract

AXI4MLIR: User-Driven Automatic Host Code Generation for Custom AXI-Based Accelerators

Agostini,

Haris,

Gibson

et al. 2024

2024 IEEE/ACM International Symposium on Code Generation and Optimization (CGO)

View full text Add to dashboard Cite

Enlighten -Research publications by members of the University of Glasgow http://eprints.gla.ac.uk 404 and then for the data associated with the tile of argument 0. 405 Furthermore, send_dim and send_idx can be used to send 406 tile dimensions or tile indices, which could be used to drive 407 more complex accelerators. Subsequent text will refer to an 408 opcode entry, such as "sA", simply as opcode. 409 -opcode_flow: represents valid opcode/data transfer flows 410 and respects the syntax scheme shown in Figure 8. Figure 6a-411 L23 shows an example, which defines an input A stationary 412 (associated with argument 0) valid flow implemented with 413 two opcodes, using the identifiers defined in the opcode_map. 414 Additional valid examples for output C stationary and nothing 415 stationary flows are shown in lines 24 and 25 of Figure 6a. 416 The information in opcode_flow is parsed and the set of 417 parentheses is understood as a proxy to specify multiple scopes 418 for sequential or nested for loops in the algorithm. Following 419 this flow, logic related to "sA" would be transmitted inside of 420 the second loop (Figure 6b-L8 to L10), and logic related to 421 "sBcCrC" would appear in the innermost loop (Figure 6b-L12 422 to L18). Suppose the user decides to forego the opportunity 423 to specify input A as stationary, then the opcode flow could 424 become "(sA sB cC rC)", and all communication driver logic 425 would be generated in the innermost loop. 426 The accel dialect: Before generating function calls for 427 runtime replacement to the DMA runtime library (described 428 in Section III-A), we perform host code transformations 5 429 (Figure 4) by lowering the linalg.generic operation, with 430 the proposed trait, to standard MLIR dialects (scf, arith, 431 memref) and a new dialect that we call accel. Operations in 432 the accel dialect abstract host-accelerator transactions, such 433 as initialization, memory transfers, and synchronization. Fig-434 ure 9 presents the core accel operations and their semantics, 435 providing examples of how these operations map onto our 436 custom AXI DMA library calls. Additionally, Figure 6b shows 437 how the accel operations are used in our MatMul example. 438 Note that it is easier to perform analysis and transformations 439 HeteroFlow [44], an FPGA accelerator programming model, 771 decouples algorithm specification from data placement op-772 timization using a new primitive ".to()". This approach 773 exposes data placement specification at various granularities, 774 achieving efficient code generation while matching optimized 775 manual HLS designs. HeteroFlow does not support arbitrary 776 custom accelerators, as it is limited to accelerators co-designed 777 with its framework (extended HeteroCL [45]). It also requires 778 the new primitive to be used while describing the algorithm 779 in Python, imposing manual application modification. Unlike 780 HeteroFlow, AXI4MLIR utilizes MLIR to target languages 781 employing linalg.generic operations during compilation, 782 elimin...

show abstract

HeteroFlow

Cited by 15 publications

References 25 publications

Automated Buffer Sizing of Dataflow Applications in a High-level Synthesis Workflow

Automated Buffer Sizing of Dataflow Applications in a High-level Synthesis Workflow

Exploring Sparse Visual Odometry Acceleration With High-Level Synthesis

AXI4MLIR: User-Driven Automatic Host Code Generation for Custom AXI-Based Accelerators

Contact Info

Product

Resources

About