Text tracking is to track multiple texts in a video, and construct a trajectory for each text. Existing methods tackle this task by utilizing the tracking-by-detection framework, i.e., detecting the text instances in each frame and associating the corresponding text instances in consecutive frames. We argue that the tracking accuracy of this paradigm is severely limited in more complex scenarios, e.g., owing to motion blur, etc., the missed detection of text instances causes the break of the text trajectory. In addition, different text instances with similar appearance are easily confused, leading to the incorrect association of the text instances. To this end, a novel spatio-temporal complementary text tracking model is proposed in this paper. We leverage a Siamese Complementary Module to fully exploit the continuity characteristic of the text instances in the temporal dimension, which effectively alleviates the missed detection of the text instances, and hence ensures the completeness of each text trajectory. We further integrate the semantic cues and the visual cues of the text instance into a unified representation via a text similarity learning network, which supplies a high discriminative power in the presence of text instances with similar appearance, and thus avoids the misassociation between them. Our method achieves state-of-the-art performance on several public benchmarks. The source code is available at https://github.com/lsabrinax/VideoTextSCM.