Proceedings of the 28th ACM International Conference on Multimedia 2020
DOI: 10.1145/3394171.3413756
|View full text |Cite
|
Sign up to set email alerts
|

Cross-modal Non-linear Guided Attention and Temporal Coherence in Multi-modal Deep Video Models

Abstract: Videos have data in multiple modalities, e.g., audio, video, text (captions). Understanding and modeling the interaction between different modalities is key for video analysis tasks like categorization, object detection, activity recognition, etc. However, data modalities are not always correlated-so, learning when modalities are correlated and using that to guide the influence of one modality on the other is crucial. Another salient feature of videos is the coherence between successive frames due to continuit… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1

Citation Types

0
4
0

Year Published

2021
2021
2022
2022

Publication Types

Select...
2
2

Relationship

2
2

Authors

Journals

citations
Cited by 4 publications
(4 citation statements)
references
References 26 publications
0
4
0
Order By: Relevance
“…Zero-shot learning with knowledge graphs: Zero-shot learning has been widely studied in computer vision (Akata et al, 2015;Lampert et al, 2013;Sahu et al, 2020;Xian et al, 2018). We will focus on related work relevant to our approach.…”
Section: Related Workmentioning
confidence: 99%
“…Zero-shot learning with knowledge graphs: Zero-shot learning has been widely studied in computer vision (Akata et al, 2015;Lampert et al, 2013;Sahu et al, 2020;Xian et al, 2018). We will focus on related work relevant to our approach.…”
Section: Related Workmentioning
confidence: 99%
“…We use YouTube-8M dataset for our experiments which consists of frame-wise video and audio features for approximately 5 million videos extracted using Inception v3 and VGGish respectively followed by PCA [1]. We use the hierarchical label space with 431 classes (see [17]). We use binary cross-entropy loss to train our models.…”
Section: Experimental-setupmentioning
confidence: 99%
“…Existing works on video classification using deep learning can be broadly divided into four categories: (i) convolutional neural networks (CNNs) [21,12,4,20], (ii) recurrent neural networks (RNNs) [28,27,26], (iii) graph-based methods [13,2], and (iv) attention-based models [17,10,7].…”
Section: Introductionmentioning
confidence: 99%
See 1 more Smart Citation