Abstract:In this paper, a new methodology for speeding up edge and line detection algorithms is presented, achieving improved performance over the state of the art software library OpenCV (speedup from 1.35 up to 2.22) and other conventional implementations, in both general and embedded processors, by reducing the number of load/store and arithmetic instructions, the number of data cache accesses and data cache misses in memory hierarchy and the algorithm memory size. This is achieved by fully exploiting the combinatio… Show more
“…As it was expected, one level of loop tiling is not performance efficient for Gaussian Blur, Sobel and Jacobi Stencil since the locality advantage is lost by the additional addressing (tiling adds more loops) and load/store instructions (there are overlapping array elements which are loaded twice [55]). Regarding Gaussian Elimination, loop tiling is not performance efficient because the loops allowed to be tiled (data dependencies) a) do not have fixed bound values (data reuse is decreased in each iteration), b) the upper row of the matrix (which is reused many times) always fits in L1.…”
Today's compilers have a plethora of optimizations-transformations to choose from, and the correct choice, order as well parameters of transformations have a significant/large impact on performance; choosing the correct order and parameters of optimizations has been a long standing problem in compilation research, which until now remains unsolved; the separate subproblems optimization gives a different schedule/binary for each sub-problem and these schedules cannot coexist, as by refining one degrades the other. Researchers try to solve this problem by using iterative compilation techniques but the search space is so big that it cannot be searched even by using modern supercomputers. Moreover, compiler transformations do not take into account the hardware architecture details and data reuse in an efficient way.In this paper, a new iterative compilation methodology is presented which reduces the search space of six compiler transformations by addressing the above problems; the search space is reduced by many orders of magnitude and thus an efficient solution is now capable to be found. The transformations are the following: loop tiling (including the number of the levels of tiling), loop unroll, register allocation, scalar replacement, loop interchange and data array layouts. The search space is reduced a) by addressing the aforementioned transformations together as one problem and not separately, b) by taking into account the custom hardware architecture details (e.g., cache size and associativity) and algorithm characteristics (e.g., data reuse).The proposed methodology has been evaluated over iterative compilation and gcc/icc compilers, on both embedded and general purpose processors; it achieves significant performance gains at many orders of magnitude lower compilation time.
“…As it was expected, one level of loop tiling is not performance efficient for Gaussian Blur, Sobel and Jacobi Stencil since the locality advantage is lost by the additional addressing (tiling adds more loops) and load/store instructions (there are overlapping array elements which are loaded twice [55]). Regarding Gaussian Elimination, loop tiling is not performance efficient because the loops allowed to be tiled (data dependencies) a) do not have fixed bound values (data reuse is decreased in each iteration), b) the upper row of the matrix (which is reused many times) always fits in L1.…”
Today's compilers have a plethora of optimizations-transformations to choose from, and the correct choice, order as well parameters of transformations have a significant/large impact on performance; choosing the correct order and parameters of optimizations has been a long standing problem in compilation research, which until now remains unsolved; the separate subproblems optimization gives a different schedule/binary for each sub-problem and these schedules cannot coexist, as by refining one degrades the other. Researchers try to solve this problem by using iterative compilation techniques but the search space is so big that it cannot be searched even by using modern supercomputers. Moreover, compiler transformations do not take into account the hardware architecture details and data reuse in an efficient way.In this paper, a new iterative compilation methodology is presented which reduces the search space of six compiler transformations by addressing the above problems; the search space is reduced by many orders of magnitude and thus an efficient solution is now capable to be found. The transformations are the following: loop tiling (including the number of the levels of tiling), loop unroll, register allocation, scalar replacement, loop interchange and data array layouts. The search space is reduced a) by addressing the aforementioned transformations together as one problem and not separately, b) by taking into account the custom hardware architecture details (e.g., cache size and associativity) and algorithm characteristics (e.g., data reuse).The proposed methodology has been evaluated over iterative compilation and gcc/icc compilers, on both embedded and general purpose processors; it achieves significant performance gains at many orders of magnitude lower compilation time.
“…In [23], authors study the vectorization process in CNNs, using Matlab code. Last, in [24], an implementation for canny edge detection algorithm is delivered. The proposed method achieves fewer L/S and arithmetical vector instructions than [17][18][19][20][21][22][23][24] for the three reasons explained above.…”
Section: Related Workmentioning
confidence: 99%
“…//Multiply by the mask (16 16-bit results) 18: 2,4,6,8,10,12,14,16,18,20,22,24,26,28 Pack the 16-bit IRs (lines 29-45). out_odd contains 15 16-bit IRs of the output pixels 1, 3,5,7,9,11,13,15,1 7,19,21,23,25,27,29 Vector Division (lines 48-50).…”
In this paper, a new method for accelerating the 2D direct Convolution operation on x86/x64 processors is presented. It includes efficient vectorization by using SIMD intrinsics, bit-twiddling optimizations, the optimization of the division operation, multi-threading using OpenMP, register blocking and the shortest possible bit-width value of the intermediate results. The proposed method, which is provided as open-source, is general and can be applied to other processor families too, e.g., Arm. The proposed method has been evaluated on two different multi-core Intel CPUs, by using twenty different image sizes, 8-bit integer computations and the most commonly used kernel sizes (3x3, 5x5, 7x7, 9x9). It achieves from 2.8× to 40× speedup over the Intel IPP library (OpenCV GaussianBlur and Filter2D routines), from 105× to 400× speedup over the gemm-based convolution method (by using Intel MKL int8 matrix multiplication routine), and from 8.5× to 618× speedup over the vslsConvExec Intel MKL direct convolution routine. The proposed method is superior as it achieves far fewer arithmetical and load/store instructions.
“…A comparison with the above libraries would be unfair because they use the SIMD (Single Instruction Multiple Data) vector instructions (they support load/store and arithmetical instructions with 128/256-bit data); however, our future work includes the support of SIMD instructions. In [29] [30] [31] [32], we have developed algorithm specific methodologies (we used the SIMD instructions), which produce lower execution time, lower compilation time and lower number of data accesses, than ATLAS [29] [30], FFTW [30] and OpenCV [32]. A comparison between the proposed methodology and [29] [30], is made in Section 4.…”
Section: Related Workmentioning
confidence: 99%
“…The proposed methodology cannot be compared with [31] because FFT contains nonlinear subscript equations (see second paragraph of Section 3). Also, the proposed methodology is not compared with [32] (Canny algorithm); this is because in [32], the four Canny kernels are optimized together and instead of four, one output loop kernel is produced. The proposed methodology optimizes each loop kernel separately and thus it cannot produce the schedules discussed in [32].…”
Section: Comparison With Iterative Compilation and Other Related Workmentioning
International audienceIt is well-known that today׳s compilers and state of the art libraries have three major drawbacks. First, the compiler sub-problems are optimized separately; this is not efficient because the separate sub-problems optimization gives a different schedule for each sub-problem and these schedules cannot coexist as the refining of one, causes the degradation of another. Second, they take into account only part of the specific algorithm׳s information. Third, they take into account only a few hardware architecture parameters. These approaches cannot give an optimal solution.In this paper, a new methodology/pre-compiler is introduced, which speeds up loop kernels, by overcoming the above problems. This methodology solves four of the major scheduling sub-problems, together as one problem and not separately; these are the sub-problems of finding the schedules with the minimum numbers of (i) L1 data cache accesses, (ii) L2 data cache accesses, (iii) main memory data accesses, (iv) addressing instructions. First, the exploration space (possible solutions) is found according to the algorithm׳s information, e.g. array subscripts. Then, the exploration space is decreased by orders of magnitude, by applying constraint propagation to the software and hardware parameters.We take the C-code and the memory architecture parameters as input and we automatically produce a new faster C-code; this code cannot be obtained by applying the existing compiler transformations to the original code. The proposed methodology has been evaluated for five well-known algorithms in both general and embedded processors; it is compared with gcc and clang compilers and also with iterative compilation
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.