Currently, the majority of robots equipped with visual‐based simultaneous mapping and localization (SLAM) systems exhibit good performance in static environments. However, practical scenarios often present dynamic objects, rendering the environment less than entirely “static.” Diverse dynamic objects within the environment pose substantial challenges to the precision of visual SLAM system. To address this challenge, we propose a real‐time visual inertial SLAM system that extensively leverages objects within the environment. First, we reject regions corresponding to dynamic objects. Following this, geometric constraints are applied within the stationary object regions to elaborate the mask of static areas, thereby facilitating the extraction of more stable feature points. Second, static landmarks are constructed based on the static regions. A spatiotemporal factor graph is then created by combining the temporal information from the Inertial Measurement Unit (IMU) with the semantic information from the static landmarks. Finally, we perform a diverse set of validation experiments on the proposed system, encompassing challenging scenarios from publicly available benchmarks and the real world. Within these experimental scenarios, we compare with state‐of‐the‐art approaches. More specifically, our system achieved a more than 40% accuracy improvement over baseline method in these data sets. The results demonstrate that our proposed method exhibits outstanding robustness and accuracy not only in complex dynamic environments but also in static environments.