In this study, a multi-level scale stabilizer intended for visual odometry (MLSS-VO) combined with a self-supervised feature matching method is proposed to address the scale uncertainty and scale drift encountered in the field of monocular visual odometry. Firstly, the architecture of an instance-level recognition model is adopted to propose a feature matching model based on a Siamese neural network. Combined with the traditional approach to feature point extraction, the feature baselines on different levels are extracted, and then treated as a reference for estimating the motion scale of the camera. On this basis, the size of the target in the tracking task is taken as the top-level feature baseline, while the motion matrix parameters as obtained by the original visual odometry of the feature point method are used to solve the real motion scale of the current frame. The multi-level feature baselines are solved to update the motion scale while reducing the scale drift. Finally, the spatial target localization algorithm and the MLSS-VO are applied to propose a framework intended for the tracking of target on the mobile platform. According to the experimental results, the root mean square error (RMSE) of localization is less than 3.87 cm, and the RMSE of target tracking is less than 4.97 cm, which demonstrates that the MLSS-VO method based on the target tracking scene is effective in resolving scale uncertainty and restricting scale drift, so as to ensure the spatial positioning and tracking of the target.