Loop unrolling and shifting for reconfigurable architectures

Dragomir, Ozana Silvia; Stefanov, Todor; Bertels, Koen

doi:10.1109/fpl.2008.4629926

Cited by 6 publications

(7 citation statements)

References 10 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…The proposed joint loop transformation in consideration communication cost, PE utilization rate and configuration cost (Joint PE+COM+CFG) is compared with two reference points. The first reference point is the loop unrolling based optimization scheme [2], where all the loops are unrolled and converted into DFGs. Therefor, the regularity of original code is disarranged and optimization is performed on the generated DFG.…”

Section: Experiments Resultsmentioning

confidence: 99%

“…However, our approach performs better than the loop unrolling based approaches [2] in all the three example cases, where the execution performance of 1-d JACOBI, ME and PDE solvers are improved by 28.3%, 25.6% and 36.7%, respectively. Subsequently, we focus on the performance of our proposed approach and combined PE+COM approach [5] on the ME and PDE solver kernels, where the performance of our proposed approach is better than that of combined PE+COM [5] approach.…”

Section: Figmentioning

confidence: 98%

“…Since only a single iteration is analyzed, limited parallelism could be achieved. Loop unrolling [2], [3] is a common technique to generate a mapping scheme with greater parallelism. It unrolls a loop and transfer it into a DFG.…”

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

Affine Transformations for Communication and Reconfiguration Optimization of Mapping Loop Nests on CGRAs

Yin

Liu

et al. 2013

IEICE Trans. Inf. & Syst.

View full text Add to dashboard Cite

SUMMARYA coarse-grained reconfigurable architecture (CGRA) is typically hybrid architecture, which is composed of a reconfigurable processing unit (RPU) and a host microprocessor. Many computationintensive kernels (e.g., loop nests) are often mapped onto RPUs to speed up the execution of programs. Thus, mapping optimization of loop nests is very important to improve the performance of CGRA. Processing element (PE) utilization rate, communication volume and reconfiguration cost are three crucial factors for the performance of RPUs. Loop transformations can affect these three performance influencing factors greatly, and would be of much significance when mapping loops onto RPUs. In this paper, a joint loop transformation approach for RPUs is proposed, where the PE utilization rate, communication cost and reconfiguration cost are under a joint consideration. Our approach could be integrated into compilers for CGRAs to improve the operating performance. Compared with the communicationminimal approach, experimental results show that our scheme can improve 5.8% and 13.6% of execution time on motion estimation (ME) and partial differential equation (PDE) solvers kernels, respectively. Also, run-time complexity is acceptable for the practical cases.

show abstract

Section: Experiments Resultsmentioning

confidence: 99%

Section: Figmentioning

confidence: 98%

See 1 more Smart Citation

Affine Transformations for Communication and Reconfiguration Optimization of Mapping Loop Nests on CGRAs

Yin

Liu

et al. 2013

IEICE Trans. Inf. & Syst.

View full text Add to dashboard Cite

show abstract

“…This article extends our previous work on loop unrolling [Dragomir et al 2008a] and loop unrolling plus shifting [Dragomir et al 2008b]. In the following section we will present the methodology for choosing the more suitable of the two transformations and the optimal unroll factor (which may be 1, if only loop shifting is used).…”

Section: Background and Related Workmentioning

confidence: 96%

Optimal Loop Unrolling and Shifting for Reconfigurable Architectures

Dragomir

Stefanov

Bertels

2009

ACM Trans. Reconfigurable Technol. Syst.

Self Cite

View full text Add to dashboard Cite

In this article, we present a new technique for optimizing loops that contain kernels mapped on a reconfigurable fabric. We assume the Molen machine organization as our framework. We propose combining loop unrolling with loop shifting, which is used to relocate the function calls contained in the loop body such that in every iteration of the transformed loop, software functions (running on GPP) execute in parallel with multiple instances of the kernel (running on FPGA). The algorithm computes the optimal unroll factor and determines the most appropriate transformation (which can be the combination of unrolling plus shifting or either of the two). This method is based on profiling information about the kernel's execution times on GPP and FPGA, memory transfers and area utilization. In the experimental part, we apply this method to several kernels from loop nests extracted from real-life applications (DCT and SAD from MPEG2 encoder, Quantizer from JPEG, and Sobel's Convolution) and perform an analysis of the results, comparing them with the theoretical maximum speedup by Amdahl's Law and showing when and how our transformations are beneficial.

show abstract

“…The static mappers enjoy relaxed processing deadlines. The relaxed processing deadlines allow them to execute complex algorithms such as modulo scheduling [12] [13] and affine loop transformation [14] to efficiently exploit parallelism [15] [16]. Although they find optimal mappings, the compiletime decisions are unable to efficiently cope with unpredictable scenarios found in many real world applications.…”

Section: Related Work and Contributionsmentioning

confidence: 99%

RuRot: Run-time rotatable-expandable partitions for efficient mapping in CGRAs

Jafri

Serrano

Iqbal

et al. 2014

2014 International Conference on Embedded Computer Systems: Architectures, Modeling, and Simulation (SAMOS XIV)

View full text Add to dashboard Cite

Today, Coarse Grained Reconfigurable Architectures (CGRAs) host multiple applications, with arbitrary communication and computation patterns. Compile-time mapping decisions are neither optimal nor desirable to efficiently support the diverse and unpredictable application requirements. As a solution to this problem, recently proposed architectures offer run-time remapping. The run-time remappers displace or expand (parallelize/serialize) an application to optimize different parameters (such as platform utilization). However, the existing remappers support application displacement or expansion in either horizontal or vertical direction. Moreover, most of the works only address dynamic remapping in packet-switched networks and therefore are not applicable to the CGRAs that exploit circuitswitching for low-power and high predictability. To enhance the optimality of the run-time remappers, this paper presents a design framework called Run-time Rotatable-expandable Partitions (RuRot). RuRot provides architectural support to dynamically remap or expand (i.e. parallelize) the hosted applications in CGRAs with circuit-switched interconnects. Compared to state of the art, the proposed design supports application rotation (in clockwise and anticlockwise directions) and displacement (in horizontal and vertical directions), at run-time. Simulation results using a few applications reveal that the additional flexibility enhances the device utilization, significantly (on average 50 % for the tested applications). Synthesis results confirm that the proposed remapper has negligible silicon (0.2 % of the platform) and timing (2 cycles per application) overheads.

show abstract

Loop unrolling and shifting for reconfigurable architectures

Cited by 6 publications

References 10 publications

Affine Transformations for Communication and Reconfiguration Optimization of Mapping Loop Nests on CGRAs

Affine Transformations for Communication and Reconfiguration Optimization of Mapping Loop Nests on CGRAs

Optimal Loop Unrolling and Shifting for Reconfigurable Architectures

RuRot: Run-time rotatable-expandable partitions for efficient mapping in CGRAs

Contact Info

Product

Resources

About