Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics 2020
DOI: 10.18653/v1/2020.acl-main.214
|View full text |Cite
|
Sign up to set email alerts
|

Integrating Multimodal Information in Large Pretrained Transformers

Abstract: Recent Transformer-based contextual word representations, including BERT and XLNet, have shown state-of-the-art performance in multiple disciplines within NLP. Fine-tuning the trained contextual models on task-specific datasets has been the key to achieving superior performance downstream. While finetuning these pre-trained models is straightforward for lexical applications (applications with only language modality), it is not trivial for multimodal language (a growing area in NLP focused on modeling face-to-f… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

1
148
0

Year Published

2021
2021
2022
2022

Publication Types

Select...
4
2
1

Relationship

0
7

Authors

Journals

citations
Cited by 329 publications
(189 citation statements)
references
References 18 publications
1
148
0
Order By: Relevance
“…The word embeddings are set trainable and initialized with glove.840B.300d. The models with pre-trained contextualized word embeddings [14] are hence excluded to ensure identical usage of external corpora.…”
Section: Methodsmentioning
confidence: 99%
See 2 more Smart Citations
“…The word embeddings are set trainable and initialized with glove.840B.300d. The models with pre-trained contextualized word embeddings [14] are hence excluded to ensure identical usage of external corpora.…”
Section: Methodsmentioning
confidence: 99%
“…Based on pre-trained language models, an approach to integrating visual and acoustic features into the pre-trained word-level textual features has recently been proposed [14] to suit the multimodal context. Since the pre-trained language model well-capture word semantics by training on a large corpus, their multimodal adaptations (MAG-XLNet, MAG-Bert) beat all existing models on multimodal sentiment analysis.…”
Section: Multimodal Sentiment Analysismentioning
confidence: 99%
See 1 more Smart Citation
“…Inspired by the work of [15] on multimodal sentiment analysis, we use an attention gate to fuse the high-level feature from the NLU encoder with the low-level feature of the ASR decoder. Normally, for the ASR task, the latent space representation of individual tokens is directly conditioned on the input speech features.…”
Section: Top-down Attention Gatementioning
confidence: 99%
“…More specifically, on top of the traditional transformer Sequence-to-Sequence (s2s) model [12] used in recent E2E SLU models [13,14], we introduce an additional transformer encoder for the NLU task, which allows the model to utilize the entire sequence to gain semantic understanding by attention. In addition, following the multimodal literature, we combine the top-level NLU feature with the low-level acoustic features using an attention gate [15], which generates a shift in the low-level representation, adapting it according to the high-level information, thus improving the performance.…”
Section: Introductionmentioning
confidence: 99%