Research in multimodal interfaces aims to provide immersive solutions and to increase overall human performance. A promising direction is to combine auditory, visual and haptic interaction between the user and the simulated environment. However, no extensive comparison exists to show how combining audiovisuohaptic interfaces would affect human perception and by extent reflected on task performance. Our paper explores this idea and presents a thorough, full-factorial comparison of how all combinations of audio, visual and haptic interfaces affect performance during manipulation. We evaluated how each combination affects the performance in a study (N = 25) consisting of manipulation tasks with various difficulties. The overall performance was assessed using both subjective measures, by assessing cognitive workload and system usability, and objective measurements, by incorporating time and spatial accuracy-based metrics. The results showed that regardless of task complexity, the combination of stereoscopic-vision with the virtual reality headset increased performance across all measurements by 40%, compared to monocular-vision from a generic display monitor. Besides, using haptic feedback improved outcomes by 10% and auditory feedback accounted for approximately 5% improvement.