We present a fast algorithm for out-of-place bit-reversed permutation of large vectors for input to an FFT. It is an extension of two previously published methods with special consideration of advanced CPU hardware features. In particular, the method makes heavy use of cache prefetching, MMX and SSE units, and write-combining buffers. Implementations have been made in assembly language for 2-byte and 4-byte operands. In terms of efficiency the method significantly outperforms previously reported methods.
INTRODUCTIONThe Fast Fourier Transform (FFT) is a frequently used operation in many engineering applications. Often, overall application performance is determined by FFT runtime. Therefore, an efficient FFT implementation can be of greatest importance. One problem is that for a radix-2 FFT, input data is not read in natural order, but instead in the order of the bitreversed indices. This can cause severe performance penalties due to the irregular memory address pattern. The problem becomes worse if special-purpose processors such as Graphics Processing Units (GPUs) are used for FFT processing. These chips are optimized for linear access to video memory, any deviation from this access pattern can cause the memory bandwidth to drop to a few percent of peak performance.CPUs, on the other hand, are much better optimized for irregular memory accesses by means of large on-chip caches. Also, typical memory systems in modern PCs are designed for shorter access latencies.Moreover, a common scenario includes a data acquisition device providing a constant stream of data. Then, offloading the bit-reversed permutation (henceforth BRP) from the GPU can increase system efficiency by establishing concurrent CPU-GPU operation. Lastly, if the data stream needs to be reformatted anyway (stripping headers etc.) then it might be possible to merge the bit reversal with the required reformatting. Thus, using the CPU for BRP has potential benefits.However, such operation needs to keep pace with the algorithms running on the GPU. FFTs even on large vectors (>8M elements) take only a few milliseconds. All previously published methods on CPU-based BRP appeared to be too slow. This was the motivation for this work.Our department's main mission is processing signals from the Effelsberg radio telescope in Germany. The actual project deals with high-precision pulsar timing. Here, the signal chain includes analog amplifiers and filters, a high-speed data acquisition device, and one or more PCs equipped with powerful GPUs for signal processing.Inside the PC, data travels from the network adapter to main memory, possibly multiple times between the CPU and main memory, and finally via DMA to video memory. The task is to minimize traffic between CPU and main memory.The hardware features we can use are the caches and the write buffers of the CPU, as will be detailed later. Bit-reversed permutation in the inner loop is done using the interleave-operations provided by the MMX-and SSE-units as was proposed in [6]. Loop organization is done in the spiri...