Spatio-Temporal Action Graph Networks

Herzig, Roei; Levi, E.; Xu, Huijuan; Gao, Hang; Brosh, Eli; Wang, Xiaolong; Globerson, Amir; Darrell, Trevor

doi:10.1109/iccvw.2019.00288

Cited by 71 publications

(47 citation statements)

References 51 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Attention for action recognition: There has been a large body of work on incorporating attention in neural networks, primarily focused on language related tasks [44,51]. Attention for videos has been pursued in various forms, including gating or second order pooling [12,30,31,49], guided by human pose or other primitives [4,5,12,13], regiongraph representations [19,48], recurrent models [37] and self-attention [47]. Our model can be thought of as a form of self-attention complementary to these approaches.…”

Section: Related Workmentioning

confidence: 99%

Video Action Transformer Network

Girdhar

Carreira

Doersch

et al. 2019

2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

702

415

View full text Add to dashboard Cite

We introduce the Action Transformer model for recognizing and localizing human actions in video clips. We repurpose a Transformer-style architecture to aggregate features from the spatiotemporal context around the person whose actions we are trying to classify. We show that by using high-resolution, person-specific, class-agnostic queries, the model spontaneously learns to track individual people and to pick up on semantic context from the actions of others. Additionally its attention mechanism learns to emphasize hands and faces, which are often crucial to discriminate an action -all without explicit supervision other than boxes and class labels. We train and test our Action Transformer network on the Atomic Visual Actions (AVA) dataset, outperforming the state-of-the-art by a significant margin using only raw RGB frames as input.

show abstract

Section: Related Workmentioning

confidence: 99%

Video Action Transformer Network

Girdhar

Carreira

Doersch

et al. 2019

2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

702

415

View full text Add to dashboard Cite

show abstract

“…Other approaches use physiological signals from the driver [22]. In recent years, deep-learning computer vision techniques have been applied to anomaly detection in first-person driving videos [14,15,23,24]. Other works further attempt to classify the type of anomaly occurring in the video, either offline after the video is fully observed [25][26][27] or in real time [16].…”

Section: Traffic Video Anomaly Detection and Classificationmentioning

confidence: 99%

“…STAG [24] Anomaly detection (supervised) Uses a spatio-temporal action graph (STAG) network to model the spatial and temporal relations among objects.…”

Section: Dsa-rnn [23]mentioning

confidence: 99%

Smart Black Box 2.0: Efficient High-Bandwidth Driving Data Collection Based on Video Anomalies

2021

View full text Add to dashboard Cite

Autonomous vehicles require fleet-wide data collection for continuous algorithm development and validation. The smart black box (SBB) intelligent event data recorder has been proposed as a system for prioritized high-bandwidth data capture. This paper extends the SBB by applying anomaly detection and action detection methods for generalized event-of-interest (EOI) detection. An updated SBB pipeline is proposed for the real-time capture of driving video data. A video dataset is constructed to evaluate the SBB on real-world data for the first time. SBB performance is assessed by comparing the compression of normal and anomalous data and by comparing our prioritized data recording with an FIFO strategy. The results show that SBB data compression can increase the anomalous-to-normal memory ratio by ∼25%, while the prioritized recording strategy increases the anomalous-to-normal count ratio when compared to an FIFO strategy. We compare the real-world dataset SBB results to a baseline SBB given ground-truth anomaly labels and conclude that improved general EOI detection methods will greatly improve SBB performance.

show abstract

“…The core idea is to enable communication between image regions to build contextualized representations of these regions. Graph networks have been successfully applied to various tasks, from object detection [25] and region classification [7] to human-object interaction [30] and activity recognition [12]. Besides, self-attention models [35] and non-local networks [38] can also be cast as graph networks in a general sense.…”

Section: Graph Network and Contextualized Representationsmentioning

confidence: 99%

Language-Conditioned Graph Networks for Relational Reasoning

Rohrbach

Darrell

et al. 2019

2019 IEEE/CVF International Conference on Computer Vision (ICCV)

Self Cite

157

View full text Add to dashboard Cite

Solving grounded language tasks often requires reasoning about relationships between objects in the context of a given task. For example, to answer the question "What color is the mug on the plate?" we must check the color of the specific mug that satisfies the "on" relationship with respect to the plate. Recent work has proposed various methods capable of complex relational reasoning. However, most of their power is in the inference structure, while the scene is represented with simple local appearance features. In this paper, we take an alternate approach and build contextualized representations for objects in a visual scene to support relational reasoning. We propose a general framework of Language-Conditioned Graph Networks (LCGN), where each node represents an object, and is described by a context-aware representation from related objects through iterative message passing conditioned on the textual input. E.g., conditioning on the "on" relationship to the plate, the object "mug" gathers messages from the object "plate" to update its representation to "mug on the plate", which can be easily consumed by a simple classifier for answer prediction. We experimentally show that our LCGN approach effectively supports relational reasoning and improves performance across several tasks and datasets.

show abstract

Spatio-Temporal Action Graph Networks

Cited by 71 publications

References 51 publications

Video Action Transformer Network

Video Action Transformer Network

Smart Black Box 2.0: Efficient High-Bandwidth Driving Data Collection Based on Video Anomalies

Language-Conditioned Graph Networks for Relational Reasoning

Contact Info

Product

Resources

About