Our previous studies reported that chimpanzees share an ability to produce spontaneous temporal coordination with humans (Yu & Tomonaga, 2015;2016). However, it remains unclear how visual cues of an interacting partner's movement influence on the emergence of tempo convergence. The current study conducted a comparative study in humans and chimpanzees under the same experimental setup as that used in Yu & Tomonaga (2016). Three conditions, including baseline, paired-invisible and paired-visible, were prepared. In the baseline condition, the participants produced the repetitive tapping movement alone. In contrast, in the other two paired conditions, the participants in a pair produced the tapping movement concurrently while facing a conspecific partner. However, in the paired-invisible condition, a visual barrier was placed in between the participants to control visual cues of an interacting partner's movement. Moderate auditory cues, corresponding to each participant's tapping movement, were presented throughout the conditions. Results showed that there are significant changes on the tapping tempo between baseline and the paired-invisible condition, whereas there are little changes between paired-invisible and paired-visible condition in both species. The current stepwise analysis across three conditions demonstrates that auditory cues were more influential than additive visual cues of an interacting partner's movement on the tempo convergence in humans and chimpanzees.