the aim of the study was to develop and assess the performance of a video-based augmented reality system, combining preoperative computed tomography (ct) and real-time microscopic video, as the first crucial step to keyhole middle ear procedures through a tympanic membrane puncture. Six different artificial human temporal bones were included in this prospective study. Six stainless steel fiducial markers were glued on the periphery of the eardrum, and a high-resolution CT-scan of the temporal bone was obtained. Virtual endoscopy of the middle ear based on this CT-scan was conducted on Osirix software. Virtual endoscopy image was registered to the microscope-based video of the intact tympanic membrane based on fiducial markers and a homography transformation was applied during microscope movements. These movements were tracked using Speeded-Up Robust Features (SURF) method. Simultaneously, a micro-surgical instrument was identified and tracked using a Kalman filter. The 3D position of the instrument was extracted by solving a three-point perspective framework. For evaluation, the instrument was introduced through the tympanic membrane and ink droplets were injected on three middle ear structures. An average initial registration accuracy of 0.21 ± 0.10 mm (n = 3) was achieved with a slow propagation error during tracking (0.04 ± 0.07 mm). The estimated surgical instrument tip position error was 0.33 ± 0.22 mm. The target structures' localization accuracy was 0.52 ± 0.15 mm. The submillimetric accuracy of our system without tracker is compatible with ear surgery.