A methodology for speeding up edge and line detection algorithms focusing on memory architecture utilization

2017

Computing

Self Cite

Today's compilers have a plethora of optimizations-transformations to choose from, and the correct choice, order as well parameters of transformations have a significant/large impact on performance; choosing the correct order and parameters of optimizations has been a long standing problem in compilation research, which until now remains unsolved; the separate subproblems optimization gives a different schedule/binary for each sub-problem and these schedules cannot coexist, as by refining one degrades the other. Researchers try to solve this problem by using iterative compilation techniques but the search space is so big that it cannot be searched even by using modern supercomputers. Moreover, compiler transformations do not take into account the hardware architecture details and data reuse in an efficient way.In this paper, a new iterative compilation methodology is presented which reduces the search space of six compiler transformations by addressing the above problems; the search space is reduced by many orders of magnitude and thus an efficient solution is now capable to be found. The transformations are the following: loop tiling (including the number of the levels of tiling), loop unroll, register allocation, scalar replacement, loop interchange and data array layouts. The search space is reduced a) by addressing the aforementioned transformations together as one problem and not separately, b) by taking into account the custom hardware architecture details (e.g., cache size and associativity) and algorithm characteristics (e.g., data reuse).The proposed methodology has been evaluated over iterative compilation and gcc/icc compilers, on both embedded and general purpose processors; it achieves significant performance gains at many orders of magnitude lower compilation time.

Section: Resultsmentioning

confidence: 82%

A methodology pruning the search space of six compiler transformations by addressing them together as one problem and by exploiting the hardware architecture details

2017

Computing

Self Cite

“…In [23], authors study the vectorization process in CNNs, using Matlab code. Last, in [24], an implementation for canny edge detection algorithm is delivered. The proposed method achieves fewer L/S and arithmetical vector instructions than [17][18][19][20][21][22][23][24] for the three reasons explained above.…”

Section: Related Workmentioning

confidence: 99%

“…//Multiply by the mask (16 16-bit results) 18: 2,4,6,8,10,12,14,16,18,20,22,24,26,28 Pack the 16-bit IRs (lines 29-45). out_odd contains 15 16-bit IRs of the output pixels 1, 3,5,7,9,11,13,15,1 7,19,21,23,25,27,29 Vector Division (lines 48-50).…”

Section: Vectorizationmentioning

confidence: 99%

Design and Implementation of 2D Convolution on x86/x64 Processors

IEEE Trans. Parallel Distrib. Syst.

Κεραμίδας

2022

Self Cite

In this paper, a new method for accelerating the 2D direct Convolution operation on x86/x64 processors is presented. It includes efficient vectorization by using SIMD intrinsics, bit-twiddling optimizations, the optimization of the division operation, multi-threading using OpenMP, register blocking and the shortest possible bit-width value of the intermediate results. The proposed method, which is provided as open-source, is general and can be applied to other processor families too, e.g., Arm. The proposed method has been evaluated on two different multi-core Intel CPUs, by using twenty different image sizes, 8-bit integer computations and the most commonly used kernel sizes (3x3, 5x5, 7x7, 9x9). It achieves from 2.8× to 40× speedup over the Intel IPP library (OpenCV GaussianBlur and Filter2D routines), from 105× to 400× speedup over the gemm-based convolution method (by using Intel MKL int8 matrix multiplication routine), and from 8.5× to 618× speedup over the vslsConvExec Intel MKL direct convolution routine. The proposed method is superior as it achieves far fewer arithmetical and load/store instructions.

“…A comparison with the above libraries would be unfair because they use the SIMD (Single Instruction Multiple Data) vector instructions (they support load/store and arithmetical instructions with 128/256-bit data); however, our future work includes the support of SIMD instructions. In [29] [30] [31] [32], we have developed algorithm specific methodologies (we used the SIMD instructions), which produce lower execution time, lower compilation time and lower number of data accesses, than ATLAS [29] [30], FFTW [30] and OpenCV [32]. A comparison between the proposed methodology and [29] [30], is made in Section 4.…”

Section: Related Workmentioning

confidence: 99%

“…The proposed methodology cannot be compared with [31] because FFT contains nonlinear subscript equations (see second paragraph of Section 3). Also, the proposed methodology is not compared with [32] (Canny algorithm); this is because in [32], the four Canny kernels are optimized together and instead of four, one output loop kernel is produced. The proposed methodology optimizes each loop kernel separately and thus it cannot produce the schedules discussed in [32].…”

Section: Comparison With Iterative Compilation and Other Related Workmentioning

confidence: 99%

A methodology for speeding up loop kernels by exploiting the software information and the memory architecture

Computer Languages, Systems & Structures

Kritikakou

Goutis

2015

Self Cite

International audienceIt is well-known that today׳s compilers and state of the art libraries have three major drawbacks. First, the compiler sub-problems are optimized separately; this is not efficient because the separate sub-problems optimization gives a different schedule for each sub-problem and these schedules cannot coexist as the refining of one, causes the degradation of another. Second, they take into account only part of the specific algorithm׳s information. Third, they take into account only a few hardware architecture parameters. These approaches cannot give an optimal solution.In this paper, a new methodology/pre-compiler is introduced, which speeds up loop kernels, by overcoming the above problems. This methodology solves four of the major scheduling sub-problems, together as one problem and not separately; these are the sub-problems of finding the schedules with the minimum numbers of (i) L1 data cache accesses, (ii) L2 data cache accesses, (iii) main memory data accesses, (iv) addressing instructions. First, the exploration space (possible solutions) is found according to the algorithm׳s information, e.g. array subscripts. Then, the exploration space is decreased by orders of magnitude, by applying constraint propagation to the software and hardware parameters.We take the C-code and the memory architecture parameters as input and we automatically produce a new faster C-code; this code cannot be obtained by applying the existing compiler transformations to the original code. The proposed methodology has been evaluated for five well-known algorithms in both general and embedded processors; it is compared with gcc and clang compilers and also with iterative compilation