Loop shifting and compaction for the high-level synthesis of designs with complex control flow

Gupta, Sumit Kumar; Dutt, Nikil; Gupta, Rajesh K.; Nicolau, Alexandru

doi:10.1109/date.2004.1268836

Cited by 24 publications

(15 citation statements)

References 21 publications

Supporting

Mentioning

Contrasting

Unclassified

Order By: Relevance

“…Directly connected to our work are those of Guo et al [2005], Weinhardt and Luk [2001] and Gupta et al [2004], where hardware is generated after optimizing the kernel loops.…”

Section: Background and Related Workmentioning

confidence: 99%

See 1 more Smart Citation

Optimal Loop Unrolling and Shifting for Reconfigurable Architectures

Dragomir

Stefanov

Bertels

2009

ACM Trans. Reconfigurable Technol. Syst.

View full text Add to dashboard Cite

In this article, we present a new technique for optimizing loops that contain kernels mapped on a reconfigurable fabric. We assume the Molen machine organization as our framework. We propose combining loop unrolling with loop shifting, which is used to relocate the function calls contained in the loop body such that in every iteration of the transformed loop, software functions (running on GPP) execute in parallel with multiple instances of the kernel (running on FPGA). The algorithm computes the optimal unroll factor and determines the most appropriate transformation (which can be the combination of unrolling plus shifting or either of the two). This method is based on profiling information about the kernel's execution times on GPP and FPGA, memory transfers and area utilization. In the experimental part, we apply this method to several kernels from loop nests extracted from real-life applications (DCT and SAD from MPEG2 encoder, Quantizer from JPEG, and Sobel's Convolution) and perform an analysis of the results, comparing them with the theoretical maximum speedup by Amdahl's Law and showing when and how our transformations are beneficial.

show abstract

“…Directly connected to our work are those of Guo et al [2005], Weinhardt and Luk [2001] and Gupta et al [2004], where hardware is generated after optimizing the kernel loops.…”

Section: Background and Related Workmentioning

confidence: 99%

“…The work in Gupta et al [2004] is part of the SPARK project and uses shifting to expose loop parallelism and then to compact the loop by scheduling multiple operations to execute in parallel. In that case, loop shifting is performed at low level, whereas we perform it at a high functional level.…”

Section: Background and Related Workmentioning

confidence: 99%

Optimal Loop Unrolling and Shifting for Reconfigurable Architectures

Dragomir

Stefanov

Bertels

2009

ACM Trans. Reconfigurable Technol. Syst.

View full text Add to dashboard Cite

show abstract

“…High-level synthesis tools attempt to overcome this limitation by statically unrolling, flattening and pipelining loops in order to decrease the number of backwards branches that would be dynamically executed [14], but this can significantly increase the complexity of the centralized finite-state machines that implement the static schedule for the hardware datapath, resulting in very long combinational paths that can overwhelm any gains in IPC [15], [16]. Overcoming the performance limitations due to explicit control flow is the key issue that needs to be addressed for custom hardware to become performancecompetitive with conventional processors on sequential code.…”

Section: The Superscalar Performance Advantagementioning

confidence: 99%

A New Dataflow Compiler IR for Accelerating Control-Intensive Code in Spatial Hardware

Zaidi

Greaves

2014

2014 IEEE International Parallel &Amp; Distributed Processing Symposium Workshops

View full text Add to dashboard Cite

Abstract-While custom (and reconfigurable) computing can provide orders-of-magnitude improvements in energy efficiency and performance for many numeric, data-parallel applications, performance on non-numeric, sequential code is often worse than what is achievable using conventional superscalar processors. This work attempts to address the problem of improving sequential performance in custom hardware by (a) switching from a statically scheduled to a dynamically scheduled (dataflow) execution model, and (b) developing a new compiler IR for highlevel synthesis that enables aggressive exposition of ILP even in the presence of complex control flow. This new IR is directly implemented as a static dataflow graph in hardware by our prototype high-level synthesis tool-chain, and shows an average speedup of 1.13× over equivalent hardware generated using LegUp, an existing HLS tool. In addition, our new IR allows us to further trade area and energy for performance, increasing the average speedup to 1.55×, through loop unrolling, with a peak speedup of 4.05×. Our custom hardware is able to approach the sequential cycle counts of an Intel Nehalem Core i7 superscalar processor, while consuming on average only 0.25× the energy of an in-order Altera Nios IIf processor.

show abstract

“…However, implementing complex algorithms in FPGA-based systems can be a laborious work. This study is still realized by circuitvendor specific tools in many cases and requires deep design skills, so it remains the most time consuming operation in a design flow (Gupta et al, 2004;Paiz et al, 2008).…”

Section: Introductionmentioning

confidence: 99%

Design of Field Programmable Gate Array Based Emulators for Motor Control Applications

Taha¹

2012

American Journal of Applied Sciences

View full text Add to dashboard Cite

Problem statement: Field Programmable Gate Array (FPGA) circuits play a significant role in major recent embedded process control designs. However, exploiting these platforms requires deep hardware conception skills and remains an important time consuming stage in a design flow. High Level Synthesis technique avoids this bottleneck and increases design productivity as witnessed by industry specialists. Approach: This study proposes to apply this technique for the conception and implementation of a Real Time Direct Current Machine (RTDCM) emulator for an embedded control application. Results: Several FPGA-based configuration scenarios are studied. A series of tests including design and timing-precision analysis were conducted to discuss and validate the obtained hardware architectures. Conclusion/Recommendations: The proposed methodology has accelerated the design time besides it has provided an extra time to refine the hardware core of the DCM emulator. The high level synthesis technique can be applied to the control field especially to test with low cost and short delays newest algorithms and motor models.

show abstract

Loop shifting and compaction for the high-level synthesis of designs with complex control flow

Abstract: Abstract

Cited by 24 publications

References 21 publications

Optimal Loop Unrolling and Shifting for Reconfigurable Architectures

Optimal Loop Unrolling and Shifting for Reconfigurable Architectures

A New Dataflow Compiler IR for Accelerating Control-Intensive Code in Spatial Hardware

Design of Field Programmable Gate Array Based Emulators for Motor Control Applications

Contact Info

Product

Resources

About