Robotic interventions in hazardous scenarios need to pay special attention to safety, as in most cases it is necessary to have an expert operator in the loop. Moreover, the use of a multi-modal Human-Robot Interface allows the user to interact with the robot using manual control in critical steps, as well as semi-autonomous behaviours in more secure scenarios, by using, for example, object tracking and recognition techniques. This paper describes a novel vision system to track and estimate the depth of metallic targets for robotic interventions. The system has been designed for on-hand monocular cameras, focusing on solving lack of visibility and partial occlusions. This solution has been validated during real interventions at the Centre for Nuclear Research (CERN) accelerator facilities, achieving 95% success in autonomous mode and 100% in a supervised manner. The system increases the safety and efficiency of the robotic operations, reducing the cognitive fatigue of the operator during non-critical mission phases. The integration of such an assistance system is especially important when facing complex (or repetitive) tasks, in order to reduce the work load and accumulated stress of the operator, enhancing the performance and safety of the mission.