Abstract. Recently several methods for background subtraction from moving camera were proposed. They use bottom up cues to segment video frames into foreground and background regions. Due to this lack of explicit models, they can easily fail to detect a foreground object when such cues are ambiguous in certain parts of the video. This becomes even more challenging when videos need to be processed online. We present a method that enables learning of pixel-based models for foreground and background regions and, in addition, segments each frame in an online framework. The method uses long term trajectories along with a Bayesian ltering framework to estimate motion and appearance models. We compare our method to previous approaches and show results on challenging video sequences.
IntroductionOne may argue that the ultimate goal of computer vision is to learn and perceive the environment in the same way children learn. Without access to presegmented visual input, infants learn how to segment objects from background using low level cues. Inspired by this evidence, signi cant e ort in the computer vision community has focused on bottom up segmentation of images and videos. This has become ever more important with the proliferation of videos captured by moving cameras. Our goal is to develop an algorithm for foreground/background segmentation from freely moving camera in a online framework that is able to deal with arbitrary long sequences. Traditional video segmentation comes in di erent avors depending on the application, but falls short of achieving this goal. In background subtraction, moving foreground objects are segmented by learning a model of the background with the assumption of a static scene and camera. Motion segmentation methods attempt to segment sparse point trajectories based on coherency of motion. However, they lack a model of the appearance of foreground or background. Video object segmentation attempts to segment an object of interest from the video with no model of the scene background. On the other hand, there are several segmentation techniques that attempt to extend traditional image segmentation to the temporal domain. Such techniques are typically limited to segmenting a short window of time.It is frequently the case that low-level cues may be ambiguous if one only considers a short window of frames. Existing approaches either ignore this problem or resort to processing the whole video o ine. O ine methods can typically 2 Ali Elqursh, and Ahmed Elgmmal produce good results on short sequences but the complexity increases rapidly as more frames need to be processed. The key to solving this problem is to recognize that to handle long sequences in an online way one has to learn and maintain models for the background and foreground regions. Such models serve the purpose of compactly accumulating the evidence over a large number frames and are essential for high level vision tasks.The contribution of this paper is a novel online method that learns appearance and motion models of the scene (backgroun...