opposed to performing ''blind encoding'' of pixel values or data blocks. This assumption was at the core of the We describe a system which detects and tracks types of objects of interest specified a priori, such as human faces and concept of model-assisted coding of video teleconferencing bodies, in video sequences. Face location tracking algorithms sequences described in [1,2], in which finer quantization described in previous documents are extended to enable robust (requiring the allocation of a higher coding rate) is perand accurate tracking of faces and bodies in video scenes with formed in previously identified areas of interest, such as complex spatio-temporal backgrounds, i.e. cluttered static back-human faces.
grounds and moving backgrounds (for example, due to cameraThe validity of this assumption is further demonstrated motion or zoom). The new tracking algorithm includes backin the work described in this paper. Here, additional knowlground removal obtained from global motion estimation (GME), edge about sequence content in the form of global backas well as the use of combined motion and edge data and knowlground motion estimation is obtained and used to perform edge-based temporal adaptation, which jointly add significant background removal for more robust tracking of human robustness to the tracking. For typical ''head-and-shoulders'' video material with up to two persons in the scene and a faces and bodies. The added robustness is especially sigstill background, an additional 24% of successful tracking is nificant in cases of complex spatio-temporal scene backachieved by the proposed algorithm, bringing the average suc-grounds, which would typically occur with video data access rate to about 96%. For more complex material with moving quired from a hand-held video camera (e.g., in a mobile backgrounds, successful face and body tracking is achieved at situation). Many foreground/background segmentation an average rate of about 86%, whereas an algorithm which techniques have been proposed in the literature, initially does not perform background removal could only achieve less in the relatively simple case of stationary (still) backthan 10% of successful tracking. Initial coding experiments grounds [3,4], then in the more general case of moving using the information obtained from face tracking for modelbackgrounds [5][6][7][8][9]. In the former case, the techniques are assisted coding of video in QCIF format at 16 kbps demonstrate typically based on frame differencing, followed by thresh-