The M-ary source with nonstationary correlation can be encoded with a single binary low-density parity-check (LDPC) code and decoded together in distributed source coding. The joint-bitplane belief propagation (JBBP) is a useful decoding algorithm for multiple bitplanes of an M-ary source. However, it suffers from the drawbacks of low computational efficiency and long execution time. Motivated by the evolution of the Graphics Processing Unit (GPU) and the inherent parallel characteristic of the JBBP, we propose a novel approach for the computationally intensive processing of the JBBP algorithm on GPU using the compute unified device architecture programming model. Two different parallel modes are utilized for the belief passing between different nodes of the JBBP. It is found that the bottlenecks of the JBBP lie in computing the overall probability mass functions (pmfs) of symbol nodes and the overall beliefs of bit nodes. Thus, a data partitioning method is leveraged to split a large array of pmfs into small pieces which can be loaded into L1 cache instead of global memory. The optimal block size is selected which not only assigns as large L1 cache as possible for individual thread, but also guarantees multiple active warps in each stream multiprocessor. Experimental results show that when the length-6336 (length-50,688, resp.) LDPC accumulate (LDPCA) code is used to compress the source, the JBBP decoder can achieve about 20× (41×, resp.) speedup on GPU compared with the original C code on CPU. Better performance would be further obtained with longer LDPCA codes. Moreover, the parallel JBBP is also applied in hyperspectral image compression and video coding and it shows good speedup performance.
B Yong Fang