Simultaneous Localization and Mapping (SLAM) is a critical technology for accurate robot localization and path planning. It has been an important area of research to improve localization accuracy. In this paper, we propose a Transformer-based visual semantic SLAM algorithm (DDETR-SLAM) to address the shortcomings of traditional visual SLAM frameworks, such as large localization errors in dynamic scenes and “ghosting” in 3D mapping. First, by incorporating the Deformable DETR (DEtection TRansformer) network as an object detection thread, the pose estimation accuracy of the system is improved compared to ORB-SLAM2. Furthermore, a dynamic feature point culling algorithm that combines the semantic information is designed to eliminate outlier points generated by dynamic objects, thereby improving the accuracy and robustness of SLAM localization and mapping. Experiments are conducted on the public TUM datasets to verify the localization accuracy, computational efficiency, and readability of the point cloud map of DDETR-SLAM. The results show that in highly dynamic environments, the ATE (Absolute Trajectory Error), translation error, and rotation error are reduced by 98.45%, 95.34%, and 92.67%, respectively, when compared to ORB-SLAM2. In most cases, our proposed system outperforms DS-SLAM, DynaSLAM, Detect-SLAM, RGB-D SLAM, and YOLOv5 + ORB-SLAM2, and our methodology improves location accuracy. The dense mapping also has better readability. The RPE (Relative Trajectory Error) is only 0.0076 m, and the ATE is only 0.0063 m.