2020
DOI: 10.1609/aaai.v34i05.6431
|View full text |Cite
|
Sign up to set email alerts
|

Learning Relationships between Text, Audio, and Video via Deep Canonical Correlation for Multimodal Language Analysis

Abstract: Multimodal language analysis often considers relationships between features based on text and those based on acoustical and visual properties. Text features typically outperform non-text features in sentiment analysis or emotion recognition tasks in part because the text features are derived from advanced language models or word embeddings trained on massive data sources while audio and video features are human-engineered and comparatively underdeveloped. Given that the text, audio, and video are describing th… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
1
1

Citation Types

0
88
0
2

Year Published

2020
2020
2022
2022

Publication Types

Select...
4
3
1

Relationship

0
8

Authors

Journals

citations
Cited by 249 publications
(90 citation statements)
references
References 18 publications
0
88
0
2
Order By: Relevance
“…In our research, we use three publicly available pre-trained SSL models to extract features. Previous work [13], [14] have used SSL features extracted from BERT [20] to represent text modality in multimodal emotion recognition. To the best of our knowledge, this is the first time two or more pretrained SSL models are used to extract features in multimodal emotion recognition.…”
Section: B Multimodal Features Extracted From Pre-trained-ssl Algorimentioning
confidence: 99%
See 1 more Smart Citation
“…In our research, we use three publicly available pre-trained SSL models to extract features. Previous work [13], [14] have used SSL features extracted from BERT [20] to represent text modality in multimodal emotion recognition. To the best of our knowledge, this is the first time two or more pretrained SSL models are used to extract features in multimodal emotion recognition.…”
Section: B Multimodal Features Extracted From Pre-trained-ssl Algorimentioning
confidence: 99%
“…Such work mainly focuses on extracting features related to facial expressions [11] and speech signals [12] from already trained DL networks based on supervised learning methods. Most of the prior work uses both low-level features and deep features (features extracted from pre-trained DL models) [13], [14], rather than representing all modalities with deep features.…”
Section: Introductionmentioning
confidence: 99%
“…Note that this variation makes this formulation similar to the hierarchical attention formulation of Yang et al [37], except in their case there was no cross-modal influence. Correlation Network: There are existing papers that model the correlations between audio and video [2,9,14,15,31] -our paper uses cross-entropy loss to model the correlation between the audio/video channels and uses hard negative mining to train this sub-network effectively. Another aspect of our work is using DNNs to transform the output of one modality before it's used by another modality -a similar idea of shifting attention in videos has been used by Long et al [21], but they did not use non-linear transforms.…”
Section: Related Workmentioning
confidence: 99%
“…In order to use both tweets and videos, we need to consider multiple different modalities. Several methods focusing on the relationships across multiple different modalities have been proposed [ 9 , 10 , 11 ]. The method in [ 9 ] realizes efficient cross-modal video-text retrieval using multi-modal features such as visual characteristics, audio, and text.…”
Section: Introductionmentioning
confidence: 99%
“…The method in [ 9 ] realizes efficient cross-modal video-text retrieval using multi-modal features such as visual characteristics, audio, and text. Then, the method in [ 10 ] learns multimodal embeddings between modalities such as text, video, and audio via deep canonical correlation analysis. The method in [ 11 ] has a multimodal variational autoencoder (MVAE) [ 12 ] including a fake news detection network using tweets and visual information, and the performance has been improved.…”
Section: Introductionmentioning
confidence: 99%