2020
DOI: 10.48550/arxiv.2010.11985
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

MTAG: Modal-Temporal Attention Graph for Unaligned Human Multimodal Language Sequences

Abstract: Human communication is multimodal in nature; it is through multiple modalities, i.e., language, voice, and facial expressions, that opinions and emotions are expressed. Data in this domain exhibits complex multi-relational and temporal interactions. Learning from this data is a fundamentally challenging research problem. In this paper, we propose Multimodal Temporal Graph Attention Networks (MTGAT). MTGAT is an interpretable graph-based neural model that provides a suitable framework for analyzing this type of… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
1
1

Citation Types

0
10
0

Year Published

2021
2021
2024
2024

Publication Types

Select...
3
1
1

Relationship

1
4

Authors

Journals

citations
Cited by 5 publications
(11 citation statements)
references
References 31 publications
0
10
0
Order By: Relevance
“…First, certain multimodal language approaches freeze a Transformer and aim to fuse all modalities using tensor outer-products [23,24], Canonical Correlation Analysis based methods [25], Attentive LSTM based methods [26,27,8,28,19], sequence to sequence based methods [29], cross-modal Transformer-based methods [6,30,31], graph-based method [32], and multi-task learning [7].…”
Section: Human Multimodal Language Analysismentioning
confidence: 99%
“…First, certain multimodal language approaches freeze a Transformer and aim to fuse all modalities using tensor outer-products [23,24], Canonical Correlation Analysis based methods [25], Attentive LSTM based methods [26,27,8,28,19], sequence to sequence based methods [29], cross-modal Transformer-based methods [6,30,31], graph-based method [32], and multi-task learning [7].…”
Section: Human Multimodal Language Analysismentioning
confidence: 99%
“…With RNN being the main modules, they are confronted with the problems of training and long inferring time. Recently, [16,24,30] propose alternative networks to model unaligned multimodal sequences. Tsai et al [24] use cross-modal transformer and self-attention transformer to learn long-range dependency.…”
Section: Related Work 21 Human Multimodal Language Analysismentioning
confidence: 99%
“…In contrast, our proposed GraphCAGE replaces the self-attention transformer with graph-based model which produces more refined and high-level representations of sequences. In [16] and [30], sequences are transformed into graphs and GCN is applied to learn long-range dependency, which not only avoid the problems of RNN but also successfully model unaligned multimodal sequences. Nevertheless, they implement graph pooling and edge pruning to drop some nodes in order to obtain the final representation of graph, leading to information loss.…”
Section: Related Work 21 Human Multimodal Language Analysismentioning
confidence: 99%
See 2 more Smart Citations