The particle counting part in the unified gas-kinetic wave–particle method on graphics processing unit (GPU) devices is computationally intensive. This paper introduces a piecewise-hierarchical (P-H) particle counting strategy tailored for the Single Instruction Multiple Threads architecture, which leverages GPU memory hierarchy to reduce access conflicts. The strategy was evaluated based on throughput, roofline performance, and computation time metrics. Compared to the global counting strategy, the P-H approach achieved a 3.37× speedup for the particle counting kernel, and the overall program experienced a performance boost of more than 30%.