The entertainment industry, primarily the video games industry, continues to dictate the development and performance requirements of graphics hardware and computer graphics algorithms. However, despite the enormous progress in the last few years it is still not possible to achieve some of industry's demands, in particular high-fidelity rendering of complex scenes in real-time, on a single desktop machine. A realisation that sound/music and other senses are important to entertainment, led to an investigation of alternative methods, such as cross-modal interaction in order to try and achieve the goal of "realism in real-time". In this paper we investigate the cross-modal interaction between vision and audition for reducing the amount of computation required to compute visuals by introducing movement related sound effects. Additionally, we look at the effect of camera movement speed on temporal visual perception. Our results indicate that slow animations are perceived as smoother than fast animations. Furthermore, introducing the sound effect of footsteps to walking animations further increased the animation smoothness perception. This has the consequence that for certain conditions the number of frames that need to be rendered each second can be reduced, saving valuable computation time, without the viewer being aware of this reduction. The results presented are another step towards the full understanding of the auditory-visual cross-modal interaction and its importance for helping achieve "realism int real-time".