2019
DOI: 10.1007/978-3-030-11018-5_21
|View full text |Cite
|
Sign up to set email alerts
|

Learnable Pooling Methods for Video Classification

Abstract: We introduce modifications to state-of-the-art approaches to aggregating local video descriptors by using attention mechanisms and function approximations. Rather than using ensembles of existing architectures, we provide an insight on creating new architectures. We demonstrate our solutions in the "The 2nd YouTube-8M Video Understanding Challenge", by using frame-level video and audio descriptors. We obtain testing accuracy similar to the state of the art, while meeting budget constraints, and touch upon stra… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

1
170
0
2

Year Published

2019
2019
2022
2022

Publication Types

Select...
4
3
2

Relationship

0
9

Authors

Journals

citations
Cited by 88 publications
(173 citation statements)
references
References 55 publications
1
170
0
2
Order By: Relevance
“…In practice, d v = 4, 096, d c = 4, 096 and d = 4, 096 resulting in a model composed of 67M parameters. Note that the first term on the right-hand side in Equations (2) and (3) is a linear fullyconnected layer and the second term corresponds to a context gating function [31] with an output ranging between 0 and 1, which role is to modulate the output of the linear layer. As a result, this embedding function can model nonlinear multiplicative interactions between the dimensions of the input feature vector which has proven effective in other text-video embedding applications [32].…”
Section: Text-video Joint Embedding Modelmentioning
confidence: 99%
“…In practice, d v = 4, 096, d c = 4, 096 and d = 4, 096 resulting in a model composed of 67M parameters. Note that the first term on the right-hand side in Equations (2) and (3) is a linear fullyconnected layer and the second term corresponds to a context gating function [31] with an output ranging between 0 and 1, which role is to modulate the output of the linear layer. As a result, this embedding function can model nonlinear multiplicative interactions between the dimensions of the input feature vector which has proven effective in other text-video embedding applications [32].…”
Section: Text-video Joint Embedding Modelmentioning
confidence: 99%
“…This method is evaluated on higher-level activities, showing that such a visual embedding aligns well with the learned space of Word2Vec to perform zero-shot recognition of these coarser-grained classes. Miech et al [21] found that using NetVLAD [3] results in an increase in accuracy over GRUs or LSTMs for aggregation of both visual and text features. A follow up on this work [22] learns a mixture of experts embedding from multiple modalities such as appearance, motion, audio or face features.…”
Section: Related Workmentioning
confidence: 99%
“…The CCG is based on a scene traits that when a specific object in an image is found, the scene is very likely to belong to a particular class associated with the object. The CCG is inspired by context gating [31] and the CCM [9]. The concept of CCG is depicted in Fig.…”
Section: Fusion Of Object Feature and Scene Featurementioning
confidence: 99%
“…where denotes element-wise multiplication; W and b are the trainable parameters; x ob ject→scene is a pseudo scene feature obtained by converting the object feature into the scene feature through CCM, and σ (x) = 1 1+exp(−x) is a sigmoid function. The structure of CCG is motivated by context gating [31]. The context gating transforms the input feature into a new feature using a self-gating mechanism, and it demonstrated significant improvements in video understanding tasks.…”
Section: Fusion Of Object Feature and Scene Featurementioning
confidence: 99%