To meet the high throughput requirement of communication systems, the design of high-throughput low-density parity-check (LDPC) decoders has attracted significant attention. This paper proposes a high-throughput GPU-based LDPC decoder, aiming at the large-scale data process scenario, which optimizes the decoder from the perspectives of the decoding parallelism and data scheduling strategy, respectively. For decoding parallelism, the intra-codeword parallelism is fully exploited by combining the characteristics of the flooding-based decoding algorithm and GPU programming model, and the inter-codeword parallelism is improved using the single-instruction multiple-data (SIMD) instructions. For the data scheduling strategy, the utilization of off-chip memory is optimized to satisfy the demands of large-scale data processing. The experimental results demonstrate that the decoder achieves 10 Gbps throughput by incorporating the early termination mechanism on general-purpose GPU (GPGPU) devices and can also achieve a high-throughput and high-power-efficiency performance on low-power embedded GPU (EGPU) devices. Compared with the state-of-the-art work, the proposed decoder had a × 1.787 normalized throughput speedup at the same error correcting performance.