of data captured over a network of cameras for various recognition tasks. In order to limit human labour and error, this paper presents a spatial-temporal fusion approach to accurately combine information from Region of Interest (RoI) batches captured in a multi-camera surveillance scenario. In this paper, feature-level and score-level approaches are proposed for spatial-temporal fusion of information to combine information over frames, in a framework based on ensembles of GMM-UBM (Universal Background Models). At the feature-level, features in a batch of multiple frames are combined and fed to the ensemble, whereas at the score-level the outcome of ensemble for individual frames are combined. Results indicate that feature-level fusion provides higher level of accuracy in a very efficient way.
IntroductionVideo surveillance applications, such as activity recognition, are increasingly making use of multiple sensors and modalities. The fusion of multiple diverse sources of information is expected to benefit the system for the recognition of objects, persons, activities and events captured in an array of cameras.Networks of video cameras are commonly employed to monitor large areas for a variety of applications. A central issue in such networks is the tracking and recognition of individuals of interests across multiple cameras. These individuals must be recognized when leaving the Field of View (FoV) of one camera and re-identified when entering the FoV of another camera. Systems for video-to-video recognition are typically employed for person re-identification (PR). In a FoV, the appearance of an individual may be captured in reference RoIs and representative models may be learned from RoI trajectories. Then, the probe RoI may be matched against the reference model in either live (real-time monitoring) or archived (post-event analysis) [1]. In this paper, we address a PR system over wide network of cameras where no target individual enrolled to the system in advance.In such environments, where objects move and cross in the FoV of multiple cameras, it is likely to have multiple streams, recorded at different starting points with various lengths, for the same RoI of individuals (see Fig. 1a). The surveillance system must track that person across all cameras whose FoV overlap the person's path.