An error floor phenomenon, decoding performance, and throughput are three major concerns for LDPC decoders in NAND Flash applications. With a penalty method and an active iteration mechanism, we present a Unified Penalty Gradient Descent Bit Flipping (UP-GDBF) decoding algorithm, which not only possesses error-floor free property but also improves convergence speed in decoding performance. To fulfill the high-throughput requirement while maintaining reliable error correction capability, we propose an energy-based backtracking scheme to reduce 40% latency with a negligible 0.8% area overhead. Implemented in TSMC 16nm process, the proposed 4KB LDPC decoder can achieve a throughput of 19.3 Gbps with 0.120 mm 2 area to satisfy ONFI 5.0 throughput requirement. Compared to existing approaches, our decoder architecture provides superior data rate and decoding performance in both 1KB and 4KB LDPC codes.