Abstract:Abstract. We describe an approach to incorporate scene topology and semantics into pixel-level object detection and localization. Our method requires video to determine occlusion regions and thence local depth ordering, and any visual recognition scheme that provides a score at local image regions, for instance object detection probabilities. We set up a cost functional that incorporates occlusion cues induced by object boundaries, label consistency and recognition priors, and solve it using a convex optimizat… Show more
“…They usually do not use high-level object recognizers or try to improve optical flow. Taylor et al [51] incorporate object detections and use temporal information to reason about occlusions to improve their segmentation results, but do not compute optical flow. Lalos et al [30] compute optical flow for an object of interest using a tracking-by-detection approach.…”
Existing optical flow methods make generic, spatially homogeneous, assumptions about the spatial structure of the flow. In reality, optical flow varies across an image depending on object class. Simply put, different objects move differently. Here we exploit recent advances in static semantic scene segmentation to segment the image into objects of different types. We define different models of image motion in these regions depending on the type of object. For example, we model the motion on roads with homographies, vegetation with spatially smooth flow, and independently moving objects like cars and planes with affine motion plus deviations. We then pose the flow estimation problem using a novel formulation of localized layers, which addresses limitations of traditional layered models for dealing with complex scene motion. Our semantic flow method achieves the lowest error of any published monocular method in the KITTI-2015 flow benchmark and produces qualitatively better flow and segmentation than recent top methods on a wide range of natural videos.
“…They usually do not use high-level object recognizers or try to improve optical flow. Taylor et al [51] incorporate object detections and use temporal information to reason about occlusions to improve their segmentation results, but do not compute optical flow. Lalos et al [30] compute optical flow for an object of interest using a tracking-by-detection approach.…”
Existing optical flow methods make generic, spatially homogeneous, assumptions about the spatial structure of the flow. In reality, optical flow varies across an image depending on object class. Simply put, different objects move differently. Here we exploit recent advances in static semantic scene segmentation to segment the image into objects of different types. We define different models of image motion in these regions depending on the type of object. For example, we model the motion on roads with homographies, vegetation with spatially smooth flow, and independently moving objects like cars and planes with affine motion plus deviations. We then pose the flow estimation problem using a novel formulation of localized layers, which addresses limitations of traditional layered models for dealing with complex scene motion. Our semantic flow method achieves the lowest error of any published monocular method in the KITTI-2015 flow benchmark and produces qualitatively better flow and segmentation than recent top methods on a wide range of natural videos.
“…However, they do not model multiple semantic object classes, nor do they capture contextual relations between objects and background classes. Taylor et al [21] jointly infer pixel semantic classes and occlusion relationship in video segmentation. Unlike our method, they do not incorporate object instance level reasoning.…”
We tackle the problem of semantic segmentation of dynamic scene in video sequences. We propose to incorporate foreground object information into pixel labeling by jointly reasoning semantic labels of super-voxels, object instance tracks and geometric relations between objects. We take an exemplar approach to object modeling by using a small set of object annotations and exploring the temporal consistency of object motion. After generating a set of moving object hypotheses, we design a CRF framework that jointly models the supervoxel and object instances. The optimal semantic labeling is inferred by the MAP estimation of the model, which is solved by a single move-making based optimization procedure. We demonstrate the effectiveness of our method on three public datasets and show that our model can achieve superior or comparable results than the stateof-the-art with less object-level supervision.
“…It also provides relations between objects, in the sense that the scheme does not just attach labels to object, but also determines whether there are multiple objects, and in what depth ordering they are presented relative to the viewer. The key results have been presented in [17], where the ideas have been tested on benchmark datasets.…”
Section: Semantic Video Segmentationmentioning
confidence: 99%
“…The following references describe work that has been conducted during this project and acknowledge support by ARO: Year 1: [14,3,9,15,1,5,8,12], Year 2: [2,4,10,6,11,19], Year 3: [7,17,18].…”
Section: Publicationsmentioning
confidence: 99%
“…While complete transitions have not been accomplished during the period of performance, since the research focused on fundamental issues underlying the theoretical development of a theory of visual information, the research milestones accomplished enabled improvement of specific tasks that we envision will result in transitions in the near to mid-term future. These include visual-inertial sensor fusion [18], and semantic video segmentation [17].…”
This project pursued the development of representations of visual data suitable for control and decision tasks. The fundamental premise is that traditional notions of information developed in support of communication engineering-where the task is reproduction of the source data, and nuisance factors can be easily characterized statisticallyare unsuited to visual inference, where the task is decision or control, and the data formation process include scaling (that makes the continuous limit relevant) and occlusion (that makes control relevant). Specifically, the task (or classes of task) inform what portion of the data is "informative" and what is "nuisance
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.