The emerging event cameras are bio-inspired sensors that can output pixel-level brightness changes at extremely high rates, and event-based visual-inertial odometry (VIO) is widely studied and used in autonomous robots. In this paper, we propose an event-based stereo VIO system, namely ESVIO. Firstly, we present a novel direct event-based VIO method, which fuses events’ depth, Time-Surface images, and pre-integrated inertial measurement to estimate the camera motion and inertial measurement unit (IMU) biases in a sliding window non-linear optimization framework, effectively improving the state estimation accuracy and robustness. Secondly, we design an event-inertia semi-joint initialization method, through two steps of event-only initialization and event-inertia initial optimization, to rapidly and accurately solve the initialization parameters of the VIO system, thereby further improving the state estimation accuracy. Based on these two methods, we implement the ESVIO system and evaluate the effectiveness and robustness of ESVIO on various public datasets. The experimental results show that ESVIO achieves good performance in both accuracy and robustness when compared with other state-of-the-art event-based VIO and stereo visual odometry (VO) systems, and, at the same time, with no compromise to real-time performance.