Systems for still-to-video face recognition (FR) seek to detect the presence of target individuals based on reference facial still images or mug-shots. These systems encounter several challenges in video surveillance applications due to variations in capture conditions (e.g., pose, scale, illumination, blur and expression) and to camera inter-operability. Beyond these issues, few reference stills are available during enrollment to design representative facial models of target individuals. Systems for still-to-video FR must therefore rely on adaptation, multiple face representation, or synthetic generation of reference stills to enhance the intra-class variability of face models. Moreover, many FR systems only match high quality faces captured in video, which further reduces the probability of detecting target individuals. Instead of matching faces captured through segmentation to reference stills, this paper exploits Adaptive Appearance Model Tracking (AAMT) to gradually learn a track-face-model for each individual appearing in the scene. The Sequential Karhunen-Loeve technique is used for online learning of these track-face-models within a particle filter-based face tracker. Meanwhile, these models are matched over successive frames against the reference still images of each target individual enrolled to the system, and then matching scores are accumulated over several frames for robust spatiotemporal recognition. A target individual is recognized if scores accumulated for a track-face-model over a fixed time surpass some decision threshold. The main advantage of AAMT over traditional still-to-video FR systems is the greater diversity of facial representation that may be captured during operations, and this can lead to better discrimination for spatiotemporal recognition. Compared to state-of-the-art adaptive biometric systems, the proposed method selects facial captures to update an individual's face model more reliably because it relies on information from tracking. Simulation results obtained with the Chokepoint video dataset indicate that the proposed method provides a significantly higher level of performance compared state-of-the-art systems when a single reference still per individual is available for matching. This higher level of performance is achieved when the diverse facial appearances that are captured in video through AAMT correspond to that of reference stills.