Because layered low-density parity-check (LDPC) decoding algorithm was proposed, one can exploit the diversity gain to achieve performance comparable to the traditional two-phase message passing (TPMP) decoding but with about twice faster decoding convergence compared to TPMP. In order to reduce the decoding time of layered LDPC decoder, a graphics processing unit (GPU) is exploited as the modem processor so that the decoding procedure can be processed in parallel using numerous threads in the GPU. In this paper, we present the parallel algorithms and efficient implementations on the GPU for two different layered message passing schemes, the row-layered and column-layered decoding. In the experiments, the quasicyclic LDPC codes for WiFi (802.11n) and WiMAX (802.16e) are decoded by the proposed layered LDPC decoders. The experimental results show that our decoder has good bit error ratio (BER) performance comparable to TPMP decoder. The peak throughput is 712 Mbps, which is about two orders of magnitude faster than that of CPU implementation and comparable to the dedicated hardware solutions. Compared to the existing fastest GPU-based implementation, the presented decoder can achieve a performance improvement of 2.3 times. 30 R. LI ET AL.according to the construction of layers, the row-layered (RL) [4] and column-layered (CL) one [5,6]. In the CL decoding, the variable nodes (VNs) are divided to multiple layers. Similarly, the check nodes (CNs) are divided into multiple layers in the RL decoding. In this paper, both layered decoding algorithms are discussed because of their faster convergence and higher throughput.In order to reach the throughput required by the standards, dedicated application-specific integrated circuit (ASIC) solutions for LDPC decoder have been presented in recent years [7-9]. However, ASIC solutions have a long development cycle, high design cost, and fixed functionality. LDPC decoders on field programmable gate array (FPGA) [10][11][12] are also presented in recent years. FPGA provides high computation power that is required in wireless communication; the throughput of LDPC decoder can achieve over 700 Mbps. However, developers for FPGAs must learn hardware description languages and gain familiarity with the development of programming and debugging tools. Although Verilog or Verilog hardware description languages supports parameterized designs in adherence to multiple standards, the standard practice is at the register transfer level, which cannot be configured in real-time manner to apply for some scenes, such as software defined radio. Conversely, software solutions are less expensive, scalable, and flexible and have shorter development cycle.The development in circuit technology and new trends in computer architectures show that the number of cores per chip increases steadily every year at a considerable rate, which led to new forms of parallelism. As a result, many multicore platforms, such as Multicore CPUs from Intel and AMD, Cell Broadband Engines (cell/B.E) from Sony-Thosiba-IBM, and...