Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing 2021
DOI: 10.18653/v1/2021.emnlp-main.326
|View full text |Cite
|
Sign up to set email alerts
|

Vision Guided Generative Pre-trained Language Models for Multimodal Abstractive Summarization

Abstract: Multimodal abstractive summarization (MAS) models that summarize videos (vision modality) and their corresponding transcripts (text modality) are able to extract the essential information from massive multimodal data on the Internet. Recently, large-scale generative pretrained language models (GPLMs) have been shown to be effective in text generation tasks. However, existing MAS models cannot leverage GPLMs' powerful generation ability. To fill this research gap, we aim to study two research questions: 1) how … Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

1
22
0

Year Published

2022
2022
2024
2024

Publication Types

Select...
5
4

Relationship

1
8

Authors

Journals

citations
Cited by 39 publications
(23 citation statements)
references
References 39 publications
1
22
0
Order By: Relevance
“…We leverage multi-modal learning to co-represent features from both image and text. Common multimodal features co-representation techniques (Zeng et al, 2021) include fusion mechanism (early at feature level (Su et al, 2020), or late at decision/scoring level (Poria et al, 2016)), tensor factorization (Zadeh et al, 2017;Mai et al, 2019) and complex attention mechanisms which can be further classified as dot-product attention (Yu et al, 2021), multi-head attention (Cao et al, 2021;Wu et al, 2021), hierarchical attention (Pramanick et al, 2021), attention on attention among others. These techniques have proven to be effective in encoding multi-modal features.…”
Section: Multi-modal Feature Representationmentioning
confidence: 99%
“…We leverage multi-modal learning to co-represent features from both image and text. Common multimodal features co-representation techniques (Zeng et al, 2021) include fusion mechanism (early at feature level (Su et al, 2020), or late at decision/scoring level (Poria et al, 2016)), tensor factorization (Zadeh et al, 2017;Mai et al, 2019) and complex attention mechanisms which can be further classified as dot-product attention (Yu et al, 2021), multi-head attention (Cao et al, 2021;Wu et al, 2021), hierarchical attention (Pramanick et al, 2021), attention on attention among others. These techniques have proven to be effective in encoding multi-modal features.…”
Section: Multi-modal Feature Representationmentioning
confidence: 99%
“…Besides these pre-trained models, various models using reinforcement learning [12] [13], topic model [14] [15], multimodal information [16], attention head masking [17], information theory [18], extraction-and-paraphrasing [19], entity aggregation [20], factuality consistency [21] [22] [23] [24] [25] [26], deep communicating agents [27], sentence correspondence [28], graph [29] [30] [31], and bottom-up approach [32] were proposed in abstractive summarization. Because these models are not directly related to our proposed model, we do not compared with them in this paper.…”
Section: Related Workmentioning
confidence: 99%
“…Additionally, as presented in Table 5, we also run VLKD on the abstractive summarization task to evaluate its NLG performance, since BART-based methods excel on the summarization (Lewis et al, 2020;Dou et al, 2021;Yu et al, 2021b). The gap between VLKD and its backbone BART is negligible.…”
Section: Evaluation Of Nlu and Nlgmentioning
confidence: 99%