Macro-particle tracking is a prominent method to study the collective beam instabilities in accelerators. However, the heavy computation load often limits the capability of the tracking codes. One widely used macro-particle tracking code to simulate collective instabilities in storage rings is mbtrack. The Message Passing Interface (MPI) is already implemented in the original mbtrack to accelerate the simulations. However, many CPU threads are requested in mbtrack for the analysis of the coupled-bunch instabilities. Therefore, computer clusters or desktops with many CPU cores are needed. Since these are not always available, we employ as alternative a Graphics Processing Unit (GPU) with CUDA programming interface to run such simulations in a stand-alone workstation. All the heavy computations have been moved to the GPU. The benchmarks confirm that mbtrackcuda can be used to analyze coupled bunch instabilities up to at least 484 bunches. Compared to mbtrack on an 8-core CPU, 36-core CPU and a cluster, mbtrack-cuda is faster for simulations of up to 3 bunches. For 363 bunches, mbtrack-cuda needs about six times the execution time of the cluster and twice of the 36-core CPU. The multi-bunch instability analysis shows that the length of the ion-cleaning gap has no big influence, at least at filling to 3 ⁄4.1 Haisheng.Xu@ihep.ac.cn, the author is presently in the Institute