In this article we discuss our implementation of a polyphase filter for
real-time data processing in radio astronomy. We describe in detail our
implementation of the polyphase filter algorithm and its behaviour on three
generations of NVIDIA GPU cards, on dual Intel Xeon CPUs and the Intel Xeon Phi
(Knights Corner) platforms. All of our implementations aim to exploit the
potential for data reuse that the algorithm offers. Our GPU implementations
explore two different methods for achieving this, the first makes use of
L1/Texture cache, the second uses shared memory. We discuss the usability of
each of our implementations along with their behaviours. We measure performance
in execution time, which is a critical factor for real-time systems, we also
present results in terms of bandwidth (GB/s), compute (GFlop/s) and type
conversions (GTc/s). We include a presentation of our results in terms of the
sample rate which can be processed in real-time by a chosen platform, which
more intuitively describes the expected performance in a signal processing
setting. Our findings show that, for the GPUs considered, the performance of
our polyphase filter when using lower precision input data is limited by type
conversions rather than device bandwidth. We compare these results to an
implementation on the Xeon Phi. We show that our Xeon Phi implementation has a
performance that is 1.47x to 1.95x greater than our CPU implementation, however
is not insufficient to compete with the performance of GPUs. We conclude with a
comparison of our best performing code to two other implementations of the
polyphase filter, showing that our implementation is faster in nearly all
cases. This work forms part of the Astro-Accelerate project, a many-core
accelerated real-time data processing library for digital signal processing of
time-domain radio astronomy data.Comment: 19 pages, 20 figures, 5 table