2013
DOI: 10.1007/s11227-013-1049-x
|View full text |Cite
|
Sign up to set email alerts
|

A methodology for speeding up edge and line detection algorithms focusing on memory architecture utilization

Abstract: In this paper, a new methodology for speeding up edge and line detection algorithms is presented, achieving improved performance over the state of the art software library OpenCV (speedup from 1.35 up to 2.22) and other conventional implementations, in both general and embedded processors, by reducing the number of load/store and arithmetic instructions, the number of data cache accesses and data cache misses in memory hierarchy and the algorithm memory size. This is achieved by fully exploiting the combinatio… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
6
0

Year Published

2015
2015
2023
2023

Publication Types

Select...
3
2

Relationship

3
2

Authors

Journals

citations
Cited by 5 publications
(6 citation statements)
references
References 46 publications
0
6
0
Order By: Relevance
“…As it was expected, one level of loop tiling is not performance efficient for Gaussian Blur, Sobel and Jacobi Stencil since the locality advantage is lost by the additional addressing (tiling adds more loops) and load/store instructions (there are overlapping array elements which are loaded twice [55]). Regarding Gaussian Elimination, loop tiling is not performance efficient because the loops allowed to be tiled (data dependencies) a) do not have fixed bound values (data reuse is decreased in each iteration), b) the upper row of the matrix (which is reused many times) always fits in L1.…”
Section: Resultsmentioning
confidence: 82%
“…As it was expected, one level of loop tiling is not performance efficient for Gaussian Blur, Sobel and Jacobi Stencil since the locality advantage is lost by the additional addressing (tiling adds more loops) and load/store instructions (there are overlapping array elements which are loaded twice [55]). Regarding Gaussian Elimination, loop tiling is not performance efficient because the loops allowed to be tiled (data dependencies) a) do not have fixed bound values (data reuse is decreased in each iteration), b) the upper row of the matrix (which is reused many times) always fits in L1.…”
Section: Resultsmentioning
confidence: 82%
“…In [23], authors study the vectorization process in CNNs, using Matlab code. Last, in [24], an implementation for canny edge detection algorithm is delivered. The proposed method achieves fewer L/S and arithmetical vector instructions than [17][18][19][20][21][22][23][24] for the three reasons explained above.…”
Section: Related Workmentioning
confidence: 99%
“…//Multiply by the mask (16 16-bit results) 18: 2,4,6,8,10,12,14,16,18,20,22,24,26,28 Pack the 16-bit IRs (lines 29-45). out_odd contains 15 16-bit IRs of the output pixels 1, 3,5,7,9,11,13,15,1 7,19,21,23,25,27,29 Vector Division (lines 48-50).…”
Section: Vectorizationmentioning
confidence: 99%
“…A comparison with the above libraries would be unfair because they use the SIMD (Single Instruction Multiple Data) vector instructions (they support load/store and arithmetical instructions with 128/256-bit data); however, our future work includes the support of SIMD instructions. In [29] [30] [31] [32], we have developed algorithm specific methodologies (we used the SIMD instructions), which produce lower execution time, lower compilation time and lower number of data accesses, than ATLAS [29] [30], FFTW [30] and OpenCV [32]. A comparison between the proposed methodology and [29] [30], is made in Section 4.…”
Section: Related Workmentioning
confidence: 99%
“…The proposed methodology cannot be compared with [31] because FFT contains nonlinear subscript equations (see second paragraph of Section 3). Also, the proposed methodology is not compared with [32] (Canny algorithm); this is because in [32], the four Canny kernels are optimized together and instead of four, one output loop kernel is produced. The proposed methodology optimizes each loop kernel separately and thus it cannot produce the schedules discussed in [32].…”
Section: Comparison With Iterative Compilation and Other Related Workmentioning
confidence: 99%