Unified Embedding and Metric Learning for Zero-Exemplar Event Detection

Hussein, Noureldien; Gavves, Efstratios; Smeulders, A.W.M.

doi:10.1109/cvpr.2017.225

Cited by 11 publications

(9 citation statements)

References 35 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Recently, there is a major interest in understanding long-range activities, which brings news challenges. The reason is that these activities are com-plex [2], take longer to unfold [1] and are harder to model their temporal structure [3,20]. New benchmarks are proposed, as Charades [12], Epic-Kitchens [21], Breakfast [1], MultiThumos [13,22], YouCook [23] or Tasty [24].…”

Section: Related Workmentioning

confidence: 99%

PIC: Permutation Invariant Convolution for Recognizing Long-range Activities

Hussein¹,

Gavves²,

Smeulders³

2020

Preprint

Self Cite

View full text Add to dashboard Cite

Neural operations as convolutions, self-attention, and vector aggregation are the go-to choices for recognizing short-range actions. However, they have three limitations in modeling long-range activities. This paper presents PIC, Permutation Invariant Convolution, a novel neural layer to model the temporal structure of long-range activities. It has three desirable properties. i. Unlike standard convolution, PIC is invariant to the temporal permutations of features within its receptive field, qualifying it to model the weak temporal structures. ii. Different from vector aggregation, PIC respects local connectivity, enabling it to learn longrange temporal abstractions using cascaded layers. iii. In contrast to self-attention, PIC uses shared weights, making it more capable of detecting the most discriminant visual evidence across long and noisy videos. We study the three properties of PIC and demonstrate its effectiveness in recognizing the long-range activities of Charades, Breakfast, and MultiThumos.

show abstract

Section: Related Workmentioning

confidence: 99%

PIC: Permutation Invariant Convolution for Recognizing Long-range Activities

Hussein¹,

Gavves²,

Smeulders³

2020

Preprint

Self Cite

View full text Add to dashboard Cite

show abstract

“…Often, a pooling layer is very common towards the end of deep CNNs. There exist various pooling mechanisms in literature like Average or Max Pooling [26], [27], Attention Pooling [28], Rank-Pooling [29] and High-Dimensional Feature encoding [30]. The goal of pooling is to select the most important features and reduce the network size so that the model doesn't over-fit.…”

Section: Activity Recognition -Learn-able Poolingsmentioning

confidence: 99%

Attentional Learn-able Pooling for Human Activity Recognition

Debnath

O’Brien

Kumar

et al. 2021

2021 IEEE International Conference on Robotics and Automation (ICRA)

View full text Add to dashboard Cite

Human activity/behaviour monitoring and recognition is a key for facilitating humans robot interaction, and allows robots for a better scheduling of future operations. It is challenging and often addressed at different levels, such as human activity classification, future activity prediction and monitoring of the on-going activities. The paper proposes a novel attention-based learn-able pooling mechanism for human activity classification from RGB videos. Recently, most of the best performing human activity recognition approaches are based on 3D skeleton positions. The 3D skeleton positions are not always available in videos captured using RGB cameras, which are widely used in robotics applications. RGB videos contain rich spatio-temporal information and processing them semantically is a difficult task. Moreover, accurately capturing spatial information and long-term temporal dependencies is the key to achieving high recognition accuracy. We use an existing Convolutional Neural Network for image recognition to extract video features which are then processed using our innovative application of attention mechanism to focus the network on features that are more important for discrimination. Afterwards, we use a novel learn-able pooling mechanism to extract activity-aware spatio-temporal cues for efficient activity recognition. The proposed pooling mechanism learns the structural information from hidden states of a bidirectional Long Short-Term Memory network via Fisher Vectors.

show abstract

“…Traditional Human Activity Recognition: Recent surge of deep learning has significantly influenced the advancement in recognizing human activities from videos. Most attempts in this genre are usually derived from the imagebased networks, which are used to extract features from individual frames and extended them to perform temporal integration by forming a fixed size descriptor using statistical pooling such as max and average pooling [16,13], attentional pooling [11], rank pooling [9], context gating [33] and high-dimensional feature encoding [12,55]. However, an important visual cue representing the temporal pattern is overlooked in such statistical pooling and highdimensional encoding.…”

Section: Related Work and Motivationmentioning

confidence: 99%

Coarse Temporal Attention Network (CTA-Net) for Driver's Activity Recognition

Wharton¹,

Behera²,

Liu³

et al. 2021

Preprint

View full text Add to dashboard Cite

There is significant progress in recognizing traditional human activities from videos focusing on highly distinctive actions involving discriminative body movements, bodyobject and/or human-human interactions. Driver's activities are different since they are executed by the same subject with similar body parts movements, resulting in subtle changes. To address this, we propose a novel framework by exploiting the spatiotemporal attention to model the subtle changes. Our model is named Coarse Temporal Attention Network (CTA-Net), in which coarse temporal branches are introduced in a trainable glimpse network. The goal is to allow the glimpse to capture high-level temporal relationships, such as 'during', 'before' and 'after' by focusing on a specific part of a video. These branches also respect the topology of the temporal dynamics in the video, ensuring that different branches learn meaningful spatial and temporal changes. The model then uses an innovative attention mechanism to generate high-level action specific contextual information for activity recognition by exploring the hidden states of an LSTM. The attention mechanism helps in learning to decide the importance of each hidden state for the recognition task by weighing them when constructing the representation of the video. Our approach is evaluated on four publicly accessible datasets and significantly outperforms the state-of-the-art by a considerable margin with only RGB video as input.

show abstract

Unified Embedding and Metric Learning for Zero-Exemplar Event Detection

Cited by 11 publications

References 35 publications

PIC: Permutation Invariant Convolution for Recognizing Long-range Activities

PIC: Permutation Invariant Convolution for Recognizing Long-range Activities

Attentional Learn-able Pooling for Human Activity Recognition

Coarse Temporal Attention Network (CTA-Net) for Driver's Activity Recognition

Contact Info

Product

Resources

About