How to integrate the temporal and spatial continuity information, when designing the video texture description operator, is crucial to realize video face recognition and facilitate video analysis and understanding, however, it has still yet to be properly addressed. In this paper, a novel video face recognition algorithm is proposed based on an aggregated local spatial-temporal descriptor (ST-VLAD), followed by a novel Fisher Criterion-based weight-learning method, which portrays the local information of the video more accurately, therefore largely improving the representation ability of description vectors. The proposed descriptor was tested on two representative databases, Honda/UCSD and YouTube Face database, achieving accuracies of 89.7% and 87.3%, respectively. The proposed method greatly outperformed the other existing state-of-art methods, suggesting a potential broad utility in the field of video face recognition.