Highly dynamic environments can cause large deviations in visual SLAM when solving the camera position between two frames, which in turn affects the overall positioning accuracy of the system. For this reason, this paper optimizes the traditional vision SLAM algorithm by focusing on the tracking part of the SLAM algorithm. In the highly dynamic acquired environment, features are extracted through a target detection network that utilizes the multi-scale channel attention module (MS-CAB) and attention feature fusion module (AFFB). The GC-RANSAC algorithm is used to distinguish and remove dynamic feature points, and then the static feature points are utilized for position estimation. Experiments show that this paper’s algorithm has less trajectory error in high dynamic environments, improves the accuracy by more than 94% compared to ORB-SLAM2 in dynamic Walking scenarios, and improves the localisation accuracy of this paper’s algorithm by 90.0%, 86.7%, 98.8%, and 97.5% compared to ORB-SLAM3, with the time spent being only 30.08% of that of DS-SLAM. The experimental findings validate the effectiveness of this paper’s work.