2020
DOI: 10.48550/arxiv.2002.00163
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Bridging Text and Video: A Universal Multimodal Transformer for Video-Audio Scene-Aware Dialog

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
2

Citation Types

1
7
0

Year Published

2020
2020
2022
2022

Publication Types

Select...
5

Relationship

0
5

Authors

Journals

citations
Cited by 9 publications
(8 citation statements)
references
References 14 publications
1
7
0
Order By: Relevance
“…In addition to our full model, we also show results without the object level features referred to as 'CoMVT (Scene feats only)' in Table 1, as this is more similar to previous multimodal models [25,42,48,77], as well as show the effects without BERT pretraining for the text stream and the MLM loss. For each model including the baselines, we perform grid search on learning rates and report the test performance of the best models in the validation set.…”
Section: Baselinessupporting
confidence: 69%
See 4 more Smart Citations
“…In addition to our full model, we also show results without the object level features referred to as 'CoMVT (Scene feats only)' in Table 1, as this is more similar to previous multimodal models [25,42,48,77], as well as show the effects without BERT pretraining for the text stream and the MLM loss. For each model including the baselines, we perform grid search on learning rates and report the test performance of the best models in the validation set.…”
Section: Baselinessupporting
confidence: 69%
“…Related to our work is the task of scene-aware dialog prediction [2,33], where the goal is to answers questions grounded to a video clip input, given a manually created dialog history. A number of works show promising results on this task [14,32,42,45,48]. In contrast, we predict future utterances from videos using the natural speech present in the video itself.…”
Section: Related Workmentioning
confidence: 99%
See 3 more Smart Citations