Proceedings of the 24th ACM International Conference on Multimedia 2016
DOI: 10.1145/2964284.2984066
|View full text |Cite
|
Sign up to set email alerts
|

Multimodal Video Description

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

1
66
0

Year Published

2018
2018
2023
2023

Publication Types

Select...
5
2
2

Relationship

0
9

Authors

Journals

citations
Cited by 135 publications
(67 citation statements)
references
References 13 publications
1
66
0
Order By: Relevance
“…Several works have employed multimodal signals to caption the MSR-VTT dataset (Xu et al, 2016), which consists of 2K video clips from 20 general categories (e.g., "news", "sports") with an average duration of 10 seconds per clip. In particular, Ramanishka et al (2016) However -we suspect that instructional video domain is significantly different than MSR-VTT (where the audio information does not necessarily correspond to human speech), as we find that ASR-only models significantly surpass the state-of-the-art video model in our case. Palaskar et al (2019) and Shi et al (2019), contemporaneous with the submission of the present work, also examine ASR as a source of signal for generating how-to video captions.…”
Section: Related Workmentioning
confidence: 61%
“…Several works have employed multimodal signals to caption the MSR-VTT dataset (Xu et al, 2016), which consists of 2K video clips from 20 general categories (e.g., "news", "sports") with an average duration of 10 seconds per clip. In particular, Ramanishka et al (2016) However -we suspect that instructional video domain is significantly different than MSR-VTT (where the audio information does not necessarily correspond to human speech), as we find that ASR-only models significantly surpass the state-of-the-art video model in our case. Palaskar et al (2019) and Shi et al (2019), contemporaneous with the submission of the present work, also examine ASR as a source of signal for generating how-to video captions.…”
Section: Related Workmentioning
confidence: 61%
“…More recently, different features can help characterizing the video semantic meaning from different perspectives. Many existing works utilize the motion information [42], temporal information [4,18,31], and even the audio information [51] to yield competitive performance. However, the diverse features in these works are simply concatenated with each other, which ignores the relationship among them.…”
Section: Video Captioningmentioning
confidence: 99%
“…In this subsection, we compare our method with the state-of-the-art methods with multiple features on benchmark datasets, including SA [53], M3 [47], v2t navigator [18], Aalto [36], VideoLab [31], MA-LSTM [51], M&M-TGM [4], PickNet [8], LSTM-TSA IV [28], SibNet [23], MGSA [5], and SCN-LSTM [14], most of which fuse different features by simply concatenating.…”
Section: Performance Comparisonsmentioning
confidence: 99%
“…This indicates that it is beneficial to train our model using step-by-step learning. For MSR-VTT, we also compare our models with the top-3 results from the MSR-VTT challenge in the table 1 , including v2t-navigator (Jin et al 2016), Aalto (Shetty and Laaksonen 2016) and VideoLAB (Ramanishka et al 2016), which are all based on features from multiple cues such as action features and audio features. The experimental results presented in Table 1 show that our TDAM performs significantly better than other methods on all metrics.…”
Section: Comparison With the State-of-the-artmentioning
confidence: 99%