As previously referred, when the number of data elements to be sorted surpasses the number of elements sorted by each execution of the sorting unit, a softwaremerge algorithm is used. In this later case, a degradation in the speed-ups is present (the inflexion points in the chart shown in Figure 8). Note that the software-merge adds a computational time complexity O(n · log n), being n the number of elements to sort. Figure 9 gives estimations for sorting a set of 16K elements with three sorting units. We exploit the case of having support for simultaneous load/store operations to communicate data to the sorting units. For the estimations, we use two completely parallel Sorting Networks (SN-II), able to directly sort 16 and 32 elements. The second machine is a 1024-element Insertion Sorting Unit. The third machine is a FIFO-based Merge Sorting Unit able to output 512 sorted elements using two sets of 256 elements sorted by an Insertion Sorting Unit. The results take into account typical DMA load/store latencies, acquired from experimental measurements. For calculating the execution time when sorting 16K elements, the overhead of a software-merge has been included.These results indicate that the Sorting Network SN-II with size 16 (SN II 16) achieves worse results than software quicksort, even with 16 simultaneous load/store operations. The SN-II with size 32 (SN II 32) surpasses quicksort when considering more than 2 simultaneous load/store operations. The Insertion Sort Unit with size 1024 (Insertion 1024) achieves for all the cases better performance than quicksort, but since the data is fed to the sorting unit sequentially no gain is obtained by performing simultaneous load/store operations. The highest speed-ups are obtained by the Insertion 256 + FIFO-based merge sorting unit with size 512 (Insertion 256 + FIFO 512). In this case, the speed-up increases between 1 to 2 simultaneous load/store operations, as is explained by the fact that this particular unit uses 2 input FIFOs. Sorting Units for FPGA-Based Embedded Systems 21 where T (n) is the total time to sort n elements, considering that n is the maximum number of elements to sort directly on the sorting unit, k represents the simultaneous load/store operations, t load the time to load data from the memory, t store the time to store data in the memory, and t sort unit(n) the time required by the sorting unit to sort n elements, considering the data are been loadedwhere T so f tware merge(N) is the total time to sort N, elements using software merge, and p = log(N). For larges sorts typically the number N is much greater than n.
ConclusionsWe describe in this paper three different approaches for hardware sorting units. The sorting units proposed have been coupled to a microprocessor in an FPGAbased embedded system. The sorting units explore different architectures: sorting networks with one or two levels, an insertion sorting array, and a particular sorting unit based on FIFOs. We evaluated these units by coupling them to the peripheral on-chip bus in a sys...