Human-Robot interactions promise to increase as robots become more pervasive. One important aspect is gestural communication which is quite popular in rehabilitation and therapeutic robotics. Indeed, synchrony is a key component of interpersonal interactions which affects the interaction on the behavioural level, as well as on the social level. When interacting physically with a robot, one perceives the robot movements but robot actuators also produce sound. In this work, we wonder whether the sound of actuators can hamper human coordination in human-robot rhythmic interactions. Indeed, the human brain processes the auditory input in priority compared to the visual input. This property can sometimes be so powerful so as to alter or even remove the visual perception. However, under given circumstances, the auditory signal and the visual perception can reinforce each other. In this paper, we propose a study where participants were asked to perform a waving-like gesture back at a robot in three different conditions: with visual perception only, auditory perception only and both perceptions. We analyze coordination performance and focus of gaze in each condition. Results show that the combination of visual and auditory perceptions perturbs the rhythmic interaction.