The fusion of monocular visual and inertial cues has become popular in robotics, unmanned vehicles and augmented reality fields. Recent results have shown that optimization-based fusion strategies outperform filtering strategies. Robust state estimation is the core capability for optimization-based visual–inertial Simultaneous Localization and Mapping (SLAM) systems. As a result of the nonlinearity of visual–inertial systems, the performance heavily relies on the accuracy of initial values (visual scale, gravity, velocity and Inertial Measurement Unit (IMU) biases). Therefore, this paper aims to propose a more accurate initial state estimation method. On the basis of the known gravity magnitude, we propose an approach to refine the estimated gravity vector by optimizing the two-dimensional (2D) error state on its tangent space, then estimate the accelerometer bias separately, which is difficult to be distinguished under small rotation. Additionally, we propose an automatic termination criterion to determine when the initialization is successful. Once the initial state estimation converges, the initial estimated values are used to launch the nonlinear tightly coupled visual–inertial SLAM system. We have tested our approaches with the public EuRoC dataset. Experimental results show that the proposed methods can achieve good initial state estimation, the gravity refinement approach is able to efficiently speed up the convergence process of the estimated gravity vector, and the termination criterion performs well.