We describe the implementation and performance of the P 3 T (Particle-Particle Particle-Tree) scheme for simulating dense stellar systems. In P 3 T, the force experienced by a particle is split into short-range and long-range contributions. Short-range forces are evaluated by direct summation and integrated with the fourth order Hermite predictor-corrector method with the block timesteps. For long-range forces, we use a combination of the Barnes-Hut tree code and the leapfrog integrator. The tree part of our simulation environment is accelerated using graphical processing units (GPU), whereas the direct summation is carried out on the host CPU. Our code gives excellent performance and accuracy for star cluster simulations with a large number of particles even when the core size of the star cluster is small.
BackgroundDirect N -body simulation has been the most useful tool for the study of the evolution of collisional stellar systems such as star clusters and the center of the galaxy (Aarseth ). The force calculations, of which the cost is O(N ), are the most compute-intensive part of direct N -body simulations. Barnes and Hut () developed a scheme which reduces the calculation cost to O(N log N) by constructing the tree structure and evaluating the multipole expansions. Dehnen (, ) developed a scheme to reduce the calculation cost to O(N) by combining the fast multipole method (Greengard and Rokhlin ) and the tree code. Recently, the graphical processing units (GPU), which is a device originally developed for rendering the graphical image, started to be used for scientific simulations. The tree code is also implemented on GPUs and it is much faster than it is on CPUs (Gaburov et al. The tree schemes are widely used for collisionless system simulations. However, for collisional system simulations, the use of the tree code has been very limited. One reason might be that a collisional stellar system spans a wide range in timescales. Thus it is essential that each particle has its own integration timestep. This scheme is called the individual timestep or the block timestep (McMillan ). However, when we use the tree code and the block timestep together, the tree structure is reconstructed at every block timestep, because the positions of integrated particle are updated. The cost of the usual complete reconstruction of the tree is O (N log N) and not negligible.To reduce the cost of the reconstruction of the tree, McMillan and Aarseth () introduced local reconstruction of tree. They demonstrated a good performance, but there seems to be no obvious way to parallelize their scheme.