2021
DOI: 10.1109/taslp.2021.3065823
|View full text |Cite
|
Sign up to set email alerts
|

Bridging Text and Video: A Universal Multimodal Transformer for Audio-Visual Scene-Aware Dialog

Abstract: Audio-Visual Scene-Aware Dialog (AVSD) is a task to generate responses when chatting about a given video, which is organized as a track of the 8 th Dialog System Technology Challenge (DSTC8). To solve the task, we propose a universal multimodal transformer and introduce the multitask learning method to learn joint representations among different modalities as well as generate informative and fluent responses. Our method extends the natural language generation pre-trained model to multimodal dialogue generation… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
1
1

Citation Types

2
51
0

Year Published

2021
2021
2024
2024

Publication Types

Select...
5
1
1

Relationship

0
7

Authors

Journals

citations
Cited by 61 publications
(53 citation statements)
references
References 17 publications
2
51
0
Order By: Relevance
“…Transformer-based generation models appeared in DSTC8. In particular, the pre-trained Transformer-based language model such as GPT-2 (Radford et al 2019) or BERT (Devlin et al 2019) that was fine-tuned received the top two human-rated scores (Li et al 2021;Chen et al 2020). Therefore, we employ pre-training and fine-tuning of the Transformer-based language model for our response generation model as it can generate fluent sentence.…”
Section: Related Studies Network Architectures Of Avsd Modelsmentioning
confidence: 99%
See 3 more Smart Citations
“…Transformer-based generation models appeared in DSTC8. In particular, the pre-trained Transformer-based language model such as GPT-2 (Radford et al 2019) or BERT (Devlin et al 2019) that was fine-tuned received the top two human-rated scores (Li et al 2021;Chen et al 2020). Therefore, we employ pre-training and fine-tuning of the Transformer-based language model for our response generation model as it can generate fluent sentence.…”
Section: Related Studies Network Architectures Of Avsd Modelsmentioning
confidence: 99%
“…While the conventional method (Li et al 2021) uses I3D feature as its video feature V , we use the video feature extracted from the pre-trained TimeSformer model described in the following section. Input features for the response generation model are the concatenation of TimeSformer video feature, dialog history, and question.…”
Section: Proposed Response Generation Modelmentioning
confidence: 99%
See 2 more Smart Citations
“…Researchers are very keen to find various techniques to find solutions that can address the effective access of the video data. They performed various video summarization techniques such as visual frame reductions [2][3][4][5][6][7] and described the video in a textual formation . Frame reduction by applying deep learning approaches is a method to discard the video's useless or low attention frames.…”
Section: Introductionmentioning
confidence: 99%