This work is about recognizing human activities occurring in videos at distinct semantic levels, including individual actions, interactions, and group activities. The recognition is realized using a two-level hierarchy of Long Short-Term Memory (LSTM) networks, forming a feed-forward deep architecture, which can be trained end-to-end. In comparison with existing architectures of LSTMs, we make two key contributions giving the name to our approach as Confidence-Energy Recurrent Network -CERN. First, instead of using the common softmax layer for prediction, we specify a novel energy layer (EL) for estimating the energy of our predictions. Second, rather than finding the common minimum-energy class assignment, which may be numerically unstable under uncertainty, we specify that the EL additionally computes the p-values of the solutions, and in this way estimates the most confident energy minimum. The evaluation on the Collective Activity and Volleyball datasets demonstrates: (i) advantages of our two contributions relative to the common softmax and energy-minimization formulations and (ii) a superior performance relative to the state-of-the-art approaches.
This paper presents a method for localizing functional objects and predicting human intents and trajectories in surveillance videos of public spaces, under no supervision in training. People in public spaces are expected to intentionally take shortest paths (subject to obstacles) toward certain objects (e.g., vending machine, picnic table, dumpster etc.) where they can satisfy certain needs (e.g., quench thirst). Since these objects are typically very small or heavily occluded, they cannot be inferred by their visual appearance but indirectly by their influence on people's trajectories. Therefore, we call them "dark matter", by analogy to cosmology, since their presence can only be observed as attractive or repulsive "fields" in the public space. A person in the scene is modeled as an intelligent agent engaged in one of the "fields" selected depending his/her intent. An agent's trajectory is derived from an Agent-based Lagrangian Mechanics. The agents can change their intents in the middle of motion and thus alter the trajectory. For evaluation, we compiled and annotated a new dataset. The results demonstrate our effectiveness in predicting human intent behaviors and trajectories, and localizing and discovering distinct types of "dark matter" in wide public spaces.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.