“…On the GPU, we implement the computing flow by CUDA kernel functions, and invoke the kernel with large number of parallel threads to perform the computation in parallel based on single instruction multiple threads (SIMT) execution model. In our previous work [19], we have designed three kernels to perform polynomial computation, filtering computation and accumulation, which are invoked with N , N R, N threads, respectively, indicating their parallelism degrees. However, in this way, we need to share intermediate results between those kernels using GPU global memory, which will pose significant memory access overhead.…”