Diversity of training data significantly affects tracking robustness of model under unconstrained environments. However, existing labeled datasets for facial landmark tracking tend to be large but not diverse, and manually annotating the massive clips of new diverse videos is extremely expensive. To address these problems, we propose a Spatial-Temporal Knowledge Integration (STKI) approach. Unlike most existing methods which rely heavily on labeled data, STKI exploits supervisions from unlabeled data. Specifically, STKI integrates spatial-temporal knowledge from massive unlabeled videos, which has several orders of magnitude more than existing labeled video data on the diversity, for robust tracking. Our framework includes a self-supervised tracker and an image-based detector for tracking initialization. To avoid the distortion of facial shape, the tracker leverages adversarial learning to introduce facial structure prior and temporal knowledge into cycle-consistency tracking. Meanwhile, we design a graph-based knowledge distillation method, which distills the knowledge from tracking and detection results, to improve the generalization of the detector. The fine-tuned detector can provide tracker on unconstrained videos with high-quality tracking initialization. Extensive experimental results show that the proposed method achieves state-of-the-art performance on comprehensive evaluation datasets. CCS CONCEPTS • Computing methodologies → Tracking.