Audio imaging can play a fundamental role in computer vision, in particular in automated surveillance, boosting the accuracy of current systems based on standard optical cameras. We present here a method for object tracking application that fuses visual image with an audio image in the template-matching framework. Firstly, an improved template matching based tracking is presented that takes care of the chaotic movements in the template-matching algorithm. Then a fusion scheme is presented that makes use of deviations in the correlation scores pattern obtained across the individual frame in each imaging domain. The method is compared with various state of art trackers that perform track estimation using only visible imagery. Results highlight a significant improvement in the object tracking by the assistance of audio imaging using the proposed method under severe challenging vision conditions such as occlusions, object shape deformations, the presence of clutters and camouflage, etc.