Parallel LDPC decoding using CUDA and OpenMP

Park, Joo-Yul; Chung, Ki-Seok

doi:10.1186/1687-1499-2011-172

Cited by 12 publications

(12 citation statements)

References 13 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Based on the delivered messages, each node attempts to decode its own value. If the decoded value turns out to contain error, the decoding process is repeated for a predefined number of times [32]. Typically, there are two ways to deliver messages in LDPC decoding.…”

Section: Review Of Ldpc Decoding Algorithmmentioning

confidence: 99%

See 1 more Smart Citation

Graphics processing unit-accelerated joint-bitplane belief propagation algorithm in DSC

et al. 2016

View full text Add to dashboard Cite

The M-ary source with nonstationary correlation can be encoded with a single binary low-density parity-check (LDPC) code and decoded together in distributed source coding. The joint-bitplane belief propagation (JBBP) is a useful decoding algorithm for multiple bitplanes of an M-ary source. However, it suffers from the drawbacks of low computational efficiency and long execution time. Motivated by the evolution of the Graphics Processing Unit (GPU) and the inherent parallel characteristic of the JBBP, we propose a novel approach for the computationally intensive processing of the JBBP algorithm on GPU using the compute unified device architecture programming model. Two different parallel modes are utilized for the belief passing between different nodes of the JBBP. It is found that the bottlenecks of the JBBP lie in computing the overall probability mass functions (pmfs) of symbol nodes and the overall beliefs of bit nodes. Thus, a data partitioning method is leveraged to split a large array of pmfs into small pieces which can be loaded into L1 cache instead of global memory. The optimal block size is selected which not only assigns as large L1 cache as possible for individual thread, but also guarantees multiple active warps in each stream multiprocessor. Experimental results show that when the length-6336 (length-50,688, resp.) LDPC accumulate (LDPCA) code is used to compress the source, the JBBP decoder can achieve about 20× (41×, resp.) speedup on GPU compared with the original C code on CPU. Better performance would be further obtained with longer LDPCA codes. Moreover, the parallel JBBP is also applied in hyperspectral image compression and video coding and it shows good speedup performance. B Yong Fang

show abstract

Section: Review Of Ldpc Decoding Algorithmmentioning

confidence: 99%

“…The study of LDPC codes was resurrected with the work of Mackay, Neal [27,28] and Luby [1]. Moreover, LDPC codes are superior to Turbo codes which were originally regarded as the best channel coding technique before the LDPC began to draw attention [32]. Therefore, a good candidate for realization of DSC is LDPC code.…”

Section: Introductionmentioning

confidence: 99%

Graphics processing unit-accelerated joint-bitplane belief propagation algorithm in DSC

et al. 2016

View full text Add to dashboard Cite

show abstract

“…Iterative decoding based on the delivered messages is processed during the CNP and BNP steps. The decoding process is iterated until the termination condition for the decoded words is satisfied at the PC step [10].…”

Section: Ldpc Decoding Algorithmmentioning

confidence: 99%

“…As the size of an H-matrix increases, the amount of computation grows rapidly. Therefore, designing parallel LDPC decoders using multi-core processors has been actively studied to provide reliable high-speed data transmission [6]- [10]. However, even if most approaches could reduce the decoding time significantly, hardware resource utilization would be insufficient because existing models were parallelized for specific devices with hardware-dependent programming models.…”

Section: Introductionmentioning

confidence: 99%

Parallel LDPC Decoding on a Heterogeneous Platform using OpenCL

2016

KSII TIIS

View full text Add to dashboard Cite

Modern mobile devices are equipped with various accelerated processing units to handle computationally intensive applications; therefore, Open Computing Language (OpenCL) has been proposed to fully take advantage of the computational power in heterogeneous systems. This article introduces a parallel software decoder of Low Density Parity Check (LDPC) codes on an embedded heterogeneous platform using an OpenCL framework. The LDPC code is one of the most popular and strongest error correcting codes for mobile communication systems. Each step of LDPC decoding has different parallelization characteristics. In the proposed LDPC decoder, steps suitable for task-level parallelization are executed on the multi-core central processing unit (CPU), and steps suitable for data-level parallelization are processed by the graphics processing unit (GPU). To improve the performance of OpenCL kernels for LDPC decoding operations, explicit thread scheduling, vectorization, and effective data transfer techniques are applied. The proposed LDPC decoder achieves high performance and high power efficiency by using heterogeneous multi-core processors on a unified computing framework.

show abstract

“…16), Wi-Fi (IEEE802.11), digital high definition TV, wideband code division multiple access (W-CDMA), and global system for mobile communication (GSM) [1]- [7]. However, the fixed functionality of such ASIC devices limits their application to emerging communication standards because they were fixed for specific coding schemes, data rates, frequency ranges, and types of modulation [8]. In addition, manufacturing costs are high and time-to-market of hardware devices is long [9].…”

Section: Introductionmentioning

confidence: 99%

Computationally efficient implementation of a Hamming code decoder using graphics processing unit

Islam

Kim

2015

J. Commun. Netw.

View full text Add to dashboard Cite

This paper presents a computationally efficient implementation of a Hamming code decoder on a graphics processing unit (GPU) to support real-time software-defined radio (SDR), which is a software alternative for realizing wireless communication. The Hamming code algorithm is challenging to parallelize effectively on a GPU because it works on sparsely located data items with several conditional statements, leading to non-coalesced, long latency, global memory access, and huge thread divergence. To address these issues, we propose an optimized implementation of the Hamming code on the GPU to exploit the higher parallelism inherent in the algorithm. Experimental results using a compute unified device architecture (CUDA)-enabled NVIDIA GeForce GTX 560, including 335 cores, revealed that the proposed approach achieved a 99x speedup versus the equivalent CPU-based implementation.

show abstract

Parallel LDPC decoding using CUDA and OpenMP

Cited by 12 publications

References 13 publications

Graphics processing unit-accelerated joint-bitplane belief propagation algorithm in DSC

Graphics processing unit-accelerated joint-bitplane belief propagation algorithm in DSC

Parallel LDPC Decoding on a Heterogeneous Platform using OpenCL

Computationally efficient implementation of a Hamming code decoder using graphics processing unit

Contact Info

Product

Resources

About