2021
DOI: 10.48550/arxiv.2109.05522
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

TEASEL: A Transformer-Based Speech-Prefixed Language Model

Abstract: Multimodal language analysis is a burgeoning field of NLP that aims to simultaneously model a speaker's words, acoustical annotations, and facial expressions. In this area, lexicon features usually outperform other modalities because they are pre-trained on large corpora via Transformer-based models. Despite their strong performance, training a new self-supervised learning (SSL) Transformer on any modality is not usually attainable due to insufficient data, which is the case in multimodal language learning. Th… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
5
0

Year Published

2022
2022
2024
2024

Publication Types

Select...
2
1

Relationship

0
3

Authors

Journals

citations
Cited by 3 publications
(5 citation statements)
references
References 31 publications
0
5
0
Order By: Relevance
“…Subsequently, Transformer was introduced into various domains including multimodal sentiment analysis, and spawned a series of significantly innovative approaches. The Transformer model can model temporal information in the data and process unimodal data through the self-attention mechanism, and Transformer can also achieve the interaction between different modalities [29][30][31][32][48][49][50][51][52]. Furthermore, the Transformer exhibits strong generalization capabilities, making it suitable for different types of multimodal sentiment analysis tasks.…”
Section: Methods Type Description Advantages Flawsmentioning
confidence: 99%
See 1 more Smart Citation
“…Subsequently, Transformer was introduced into various domains including multimodal sentiment analysis, and spawned a series of significantly innovative approaches. The Transformer model can model temporal information in the data and process unimodal data through the self-attention mechanism, and Transformer can also achieve the interaction between different modalities [29][30][31][32][48][49][50][51][52]. Furthermore, the Transformer exhibits strong generalization capabilities, making it suitable for different types of multimodal sentiment analysis tasks.…”
Section: Methods Type Description Advantages Flawsmentioning
confidence: 99%
“…With the invention of Transformer [28] and its outstanding performance in the field of natural language processing, Transformer has been widely used in other research areas such as multimodal sentiment analysis. For example, [29][30][31][32] leverage the Transformer encoder to model correlation information between different modalities and have achieved good results in multimodal sentiment analysis. Some scholars have used tensor-based fusion methods [33][34][35] to solve the problem of fusion of multimodal features, and there are other researchers have adopted other methods such as self-supervised learning [36], contrastive learning [37], multi-task learning [38], etc.…”
Section: Introductionmentioning
confidence: 99%
“…Chen et al [39] and Poria et al [11] used an LSTM-based model as well as attentional units to capture the dynamics across modalities. In [30][31][32][33][34], multihead and self-attention were used to capture relevant information within or across modalities. In addition, the researchers additionally used other methods, e.g., Gate Recurrent Unit (GRU) [35,36] and Graph Convolutional Network (GCN) [37].…”
Section: Attention-basedmentioning
confidence: 99%
“…Poria et al [11] used attention units to capture dynamics across modalities. In [30][31][32][33][34], multihead and self-attention were used to perform cross-modal interactions, respectively, and perceive emotional information that is not within the modality. In addition, the researchers used other attention-based methods such as Gate Recursive Units (GRUs) [35,36] and Graph Convolutional Networks (GCNs) [37].…”
Section: Introductionmentioning
confidence: 99%
“…Some researchers also use multiple self-attention blocks to combine different modes in pairs through the self-attention mechanism [15]. The RoBERTa model [16] is used to train audio data as a dynamic presentation of text features to achieve very good results.…”
Section: Multi-modal Sentiment Analysis Modelmentioning
confidence: 99%