The goal of the semantic object correspondence problem is to compute dense association maps for a pair of images such that the same object parts get matched for very different appearing object instances. Our method builds on the recent findings that deep convolutional neural networks (DCNNs) implicitly learn a latent model of object parts even when trained for classification. We also leverage a key correspondence problem insight that the geometric structure between object parts is consistent across multiple object instances. These two concepts are then combined in the form of a novel optimization scheme. This optimization learns a feature embedding by rewarding for projecting features closer on the manifold if they have low feature-space distance. Simultaneously, the optimization penalizes feature clusters whose geometric structure is inconsistent with the observed geometric structure of object parts. In this manner, by accounting for feature space similarities and feature neighborhood context together, a manifold is learned where features belonging to semantically similar object parts cluster together. We also describe transferring these embedded features to the sister tasks of semantic keypoint classification and localization task via a Siamese DCNN. We provide qualitative results on the Pascal VOC 2012 images and quantitative results on the Pascal Berkeley dataset where we improve on the state of the art by over 5% on classification and over 9% on localization tasks.
Analysis of activities in low-resolution videos or far fields is a research challenge which has not received much attention. In this application scenario, it is often the case that the motion of the objects in the scene is the only low-level information available, other features like shape or color being unreliable. Also, typical videos consist of interactions of multiple objects which pose a major vision challenge. This paper proposes a method to classify activities of multiple interacting objects in low-resolution video by modeling them through a set of novel discriminative features which rely only on the object tracks. The noisy tracks of multiple objects are transformed into a feature space that encapsulates the individual characteristics of the tracks, as well as their interactions. Based on this feature vector, we propose an energy minimization approach to optimally divide the object tracks and their relative distances into meaningful partitions, called "strings of motion-words". Distances between activities can now be computed by comparing two strings. Complex activities can be broken up into strings and comparisons done separately for each object or for their interactions. We test the efficacy of our approach to search all the instances of a given query in multiple real-life video datasets.
We propose a human action recognition algorithm by capturing a compact signature of shape dynamics from multi-view videos. First, we compute R transforms and its temporal velocity on action silhouettes from multiple views to generate a robust low level representation of shape. The spatio-temporal shape dynamics across all the views is then captured by fusion of eigen and multiset partial least squares modes. This provides us a lightweight signature which is classified using a probabilistic subspace similarity technique by learning inter-action and intra-action models. Quantitative and qualitative results of our algorithm are reported on MuHAVi a publicly available multi-camera multiaction dataset.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.