Recently, a new trend has emerged in the field of parallel and high performance computing, the hybrid implementation using CPU-GPU modules. In such implementations, the computational load is shared between the CPU and GPU, in order to improve the computational efficiency. However, the task of sharing the computational load between the two modules is a rather difficult one, with a number of limitations being imposed. This paper extends our recent work on community detection, which is based on transforming a network of nodes into a set of threaded binary trees. In this work, we share the computational load between the two units: the CPU takes specific samples of the network communities and organizes them in the form of threaded binary trees. The GPU takes over the heavy load of reading this data and transforming it into a path-matrix. Finally, this matrix is sent back to the CPU for analysis, community detection and overlaps, as well as network information upgrades. Our simulation results show significant improvement over our previous strategy and other known community detection strategies found in the literature.