Extending High-Level Synthesis for Task-Parallel Programs

Chi, Yuze; Guo, Licheng; Lau, Jason; Choi, Y.; Wang, Jie; Cong, Jason

doi:10.1109/fccm51124.2021.00032

Cited by 19 publications

(3 citation statements)

References 63 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…The 'max depth' of the arrays listed in Table 1 was assigned the maximum value required among the benchmarks: V = 2,400, C = 630K, UCB size = 32K, O = 120K, and K = 64. FYalSAT has been programmed in C++, and it was synthesized with TAPA [33] and AMD/Xilinx's Vitis HLS 2022.2 [19]. The generated FPGA bitstream was tested on the Alveo U250 platform [34].…”

Section: Evaluation a Experimental Setupmentioning

confidence: 99%

FYalSAT: High-Throughput Stochastic Local Search K-SAT Solver on FPGA

Choi,

Kim

2024

IEEE Access

Self Cite

View full text Add to dashboard Cite

The satisfiability (SAT) problem is a fundamental challenge in computing and has a broad range of applications. This problem is NP-complete, and many algorithmic and architectural improvements have aimed at accelerating the SAT solver. But most existing stochastic local search (SLS) hardware solvers still rely on the outdated WalkSAT algorithm, and they have a reduced performance when handling problems with a large number of literals per clause. In this paper, we present FYalSAT, a field-programmable gate array (FPGA) based SLS SAT solver designed for high throughput. We incorporate a conflict-free data rearrangement scheme and a novel synchronization method to increase the parallelism. We also apply various optimizations such as clause prefetching, module overlapping, and pipelining to improve the performance. Experimental results demonstrate that FYalSAT outperforms the throughput of existing SLS FPGA solvers by 9.07×-110× for benchmarks with a large number of literals per clause.INDEX TERMS field-programmable gate arrays, satisfiability problem, stochastic local search, accelerator architecture I. INTRODUCTION

show abstract

Section: Evaluation a Experimental Setupmentioning

confidence: 99%

FYalSAT: High-Throughput Stochastic Local Search K-SAT Solver on FPGA

Choi,

Kim

2024

IEEE Access

Self Cite

View full text Add to dashboard Cite

show abstract

“…A holistic Task Scheduling solution is presented in [42], where a HW task scheduler with the ability to drive CPUs, GPUs, and FPGAs is described. Other approaches use Task Scheduling program representations to automatically synthesize equivalent hardware [43] or configure dataflow systems [44]. These solutions offer substantial energy and latency advantages over ordinary CPU or GPU execution, but lack the versatility that these baselines or our proposal offer.…”

Section: Related Workmentioning

confidence: 99%

Enabling HW-Based Task Scheduling in Large Multicore Architectures

Morais,

Álvarez,

Jiménez-González

et al. 2024

IEEE Trans. Comput.

View full text Add to dashboard Cite

Dynamic Task Scheduling is an enticing programming model aiming to ease the development of parallel programs with intrisically irregular or data-dependent parallelism. The performance of such solutions relies on the ability of the Task Scheduling HW/SW stack to efficiently evaluate dependencies at runtime and schedule work to available cores. Traditional SW-only systems implicate scheduling overheads of around 30K processor cycles per task, which severely limit the (core count, task granularity) combinations that they might adequately handle. Previous work on HW-accelerated Task Scheduling has shown that such systems might support high performance scheduling on processors with up to eight cores, but questions remained regarding the viability of such solutions to support the greater number of cores now frequently found in high-end SMP systems. The present work presents an FPGA-proven, tightly-integrated, Linux-capable, 30-core RISC-V system with hardware accelerated Task Scheduling. We use this implementation to show that HW Task Scheduling can still offer competitive performance at such high core count, and describe how this organization includes hardware and software optimizations that make it even more scalable than previous solutions. Finally, we outline ways in which this architecture could be augmented to overcome inter-core communication bottlenecks, mitigating the cache-degradation effects usually involved in the parallelization of highly optimized serial code.

show abstract

“…Our Dataflow cache is actually an example of cyclic dataflow graph. Fine Licht et al [16] and Chi et al [17] The tasks communicate and synchronize through FIFO queues. The request FIFO, which flows from Master to Slave, contains the inputs to the Slave operation (e.g., if the operation is a read access from an off-chip memory, it contains the address to be read).…”

Section: A Cyclic Dataflow Protocolmentioning

confidence: 99%

Array-Specific Dataflow Caches for High-Level Synthesis of Memory-Intensive Algorithms on FPGAs

et al. 2022

View full text Add to dashboard Cite

Designs implemented on field-programmable gate arrays (FPGAs) via high-level synthesis (HLS) suffer from off-chip memory latency and bandwidth bottlenecks. FPGAs can access both large but slow off-chip memories (DRAM), and fast but small on-chip memories (block RAMs and registers). HLS tools allow exploiting the memory hierarchy in a scratchpad-like fashion, requring a significant manual effort. We propose an automation of the FPGA memory management in Xilinx Vitis HLS through a fullyconfigurable C++ source-level cache. Each DRAM-mapped array can be associated with a private level 2 (L2) cache with one or more ports, and each port can optionally provide a level 1 cache. The L2 cache runs in a separate dataflow task with respect to the application accessing it. This solution isolates off-chip memory accesses and data buffering into dedicated dataflow tasks, resembling the load, compute, store design paradigm, but without the drawback of manual algorithm refactoring. Experimental results collected from an FPGA board show that our cache speeds up the execution of a variety of benchmarks by up to 60 times compared to the out-of-the-box solution provided by HLS, requiring very limited optimization effort. Our caches are not meant to compete with manually optimized implementations quality of results (QoR), but rather to significantly save design effort, in exchange for some QoR, to make the HLS flow a bit more software-like, allowing the designer to focus on algorithmic optimizations, rather than on explicit memory management. Moreover, caching could be the only feasible memory optimization for algorithms with datadependent or irregular memory access patterns, but with good data locality.

show abstract

Extending High-Level Synthesis for Task-Parallel Programs

Cited by 19 publications

References 63 publications

FYalSAT: High-Throughput Stochastic Local Search K-SAT Solver on FPGA

FYalSAT: High-Throughput Stochastic Local Search K-SAT Solver on FPGA

Enabling HW-Based Task Scheduling in Large Multicore Architectures

Array-Specific Dataflow Caches for High-Level Synthesis of Memory-Intensive Algorithms on FPGAs

Contact Info

Product

Resources

About