A population annealing method is a universal algorithm applicable to statistical mechanics systemsand optimization problems. It is potentially scalable on any parallel architecture. We review recentdevelopments in the area, emphasizing the implementation of the algorithm on a hybrid parallelprogram architecture combining CUDA and MPI. The problem is to keep all general-purpose graphicsprocessing unit devices as busy as possible by efficiently redistributing replicas. We provide testingdetails on hardware-based Intel Skylake/Nvidia V100, running more than two million replicas of theIsing model samples in parallel. As the complexity of the simulated system increases, the accelerationgrows toward perfect scalability.This work was done under Grant No. 19-11-00286 from the Russian Science Foundation and wassupported in part through computational resources of HPC facilities at HSE University.