Bridging Text and Video: A Universal Multimodal Transformer for Audio-Visual Scene-Aware Dialog

Li, Zekang; Li, Zongjia; Zhang, Jinchao; Feng, Yang; Zhou, Jie

doi:10.1109/taslp.2021.3065823

Cited by 61 publications

(53 citation statements)

References 17 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Transformer-based generation models appeared in DSTC8. In particular, the pre-trained Transformer-based language model such as GPT-2 (Radford et al 2019) or BERT (Devlin et al 2019) that was fine-tuned received the top two human-rated scores (Li et al 2021;Chen et al 2020). Therefore, we employ pre-training and fine-tuning of the Transformer-based language model for our response generation model as it can generate fluent sentence.…”

Section: Related Studies Network Architectures Of Avsd Modelsmentioning

confidence: 99%

“…While the conventional method (Li et al 2021) uses I3D feature as its video feature V , we use the video feature extracted from the pre-trained TimeSformer model described in the following section. Input features for the response generation model are the concatenation of TimeSformer video feature, dialog history, and question.…”

Section: Proposed Response Generation Modelmentioning

confidence: 99%

“…One of the advantages of using multimodal information is that the systems are able to consider more diverse interactions (e.g., dialog systems talking with the user about the events happening around them). For example, Audio Visual Scene-Aware Dialog (AVSD) has been proposed as the task of multi-turn question-answering based on given text, audio, and video signals (Nguyen et al 2019;Hori et al 2019b;Li et al 2021). Figure 1 shows the overview of the AVSD task.…”

Section: Introductionmentioning

confidence: 99%

“…Those models encode the text, audio, and video information into latent representations and generate response sentences. In the previous competition of DSTC8, Li et al (2021) presented fine-tuning of a pre-trained Transformer-based language model; it showed remarkable performance. They indicated that pre-training of text generation is beneficial for AVSD, but the quality of visual understanding remains an issue.…”

Section: Introductionmentioning

confidence: 99%

See 3 more Smart Citations

Audio Visual Scene-Aware Dialog Generation with Transformer-based Video Representations

Yamazaki¹,

Orihashi²,

Masumura³

et al. 2022

Preprint

View full text Add to dashboard Cite

There have been many attempts to build multimodal dialog systems that can respond to a question about given audiovisual information, and the representative task for such systems is the Audio Visual Scene-Aware Dialog (AVSD). Most conventional AVSD models adopt the Convolutional Neural Network (CNN)-based video feature extractor to understand visual information. While a CNN tends to obtain both temporally and spatially local information, global information is also crucial for boosting video understanding because AVSD requires long-term temporal visual dependency and whole visual information. In this study, we apply the Transformerbased video feature that can capture both temporally and spatially global representations more efficiently than the CNNbased feature. Our AVSD model with its Transformer-based feature attains higher objective performance scores for answer generation. In addition, our model achieves a subjective score close to that of human answers in DSTC10. We observed that the Transformer-based visual feature is beneficial for the AVSD task because our model tends to correctly answer the questions that need a temporally and spatially broad range of visual information.

show abstract

Section: Related Studies Network Architectures Of Avsd Modelsmentioning

confidence: 99%

Section: Proposed Response Generation Modelmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

Audio Visual Scene-Aware Dialog Generation with Transformer-based Video Representations

Yamazaki¹,

Orihashi²,

Masumura³

et al. 2022

Preprint

View full text Add to dashboard Cite

show abstract

“…Researchers are very keen to find various techniques to find solutions that can address the effective access of the video data. They performed various video summarization techniques such as visual frame reductions [2][3][4][5][6][7] and described the video in a textual formation . Frame reduction by applying deep learning approaches is a method to discard the video's useless or low attention frames.…”

Section: Introductionmentioning

confidence: 99%

Att-BiL-SL: Attention-Based Bi-LSTM and Sequential LSTM for Describing Video in the Textual Formation

et al. 2021

View full text Add to dashboard Cite

With the advancement of the technological field, day by day, people from around the world are having easier access to internet abled devices, and as a result, video data is growing rapidly. The increase of portable devices such as various action cameras, mobile cameras, motion cameras, etc., can also be considered for the faster growth of video data. Data from these multiple sources need more maintenance to process for various usages according to the needs. By considering these enormous amounts of video data, it cannot be navigated fully by the end-users. Throughout recent times, many research works have been done to generate descriptions from the images or visual scene recordings to address the mentioned issue. This description generation, also known as video captioning, is more complex than single image captioning. Various advanced neural networks have been used in various studies to perform video captioning. In this paper, we propose an attention-based Bi-LSTM and sequential LSTM (Att-BiL-SL) encoder-decoder model for describing the video in textual format. The model consists of two-layer attention-based bi-LSTM and one-layer sequential LSTM for video captioning. The model also extracts the universal and native temporal features from the video frames for smooth sentence generation from optical frames. This paper includes the word embedding with a soft attention mechanism and a beam search optimization algorithm to generate qualitative results. It is found that the architecture proposed in this paper performs better than various existing state of the art models.

show abstract

Overview of the NLPCC 2022 Shared Task: Multi-modal Dialogue Understanding and Generation

Wang

Zhao

2022

Lecture Notes in Computer Science

View full text Add to dashboard Cite

Bridging Text and Video: A Universal Multimodal Transformer for Audio-Visual Scene-Aware Dialog

Cited by 61 publications

References 17 publications

Audio Visual Scene-Aware Dialog Generation with Transformer-based Video Representations

Audio Visual Scene-Aware Dialog Generation with Transformer-based Video Representations

Att-BiL-SL: Attention-Based Bi-LSTM and Sequential LSTM for Describing Video in the Textual Formation

Overview of the NLPCC 2022 Shared Task: Multi-modal Dialogue Understanding and Generation

Contact Info

Product

Resources

About