2020
DOI: 10.1109/jstsp.2020.2987728
|View full text |Cite
|
Sign up to set email alerts
|

Multimodal Intelligence: Representation Learning, Information Fusion, and Applications

Abstract: Deep learning has revolutionized speech recognition, image recognition, and natural language processing since 2010, each involving a single modality in the input signal. However, many applications in artificial intelligence involve more than one modality. It is therefore of broad interest to study the more difficult and complex problem of modeling and learning across multiple modalities. In this paper, a technical review of the models and learning methods for multimodal intelligence is provided. The main focus… Show more

Help me understand this report
View preprint versions

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

1
95
0
1

Year Published

2020
2020
2024
2024

Publication Types

Select...
6
2
1
1

Relationship

0
10

Authors

Journals

citations
Cited by 284 publications
(97 citation statements)
references
References 199 publications
(203 reference statements)
1
95
0
1
Order By: Relevance
“…These diverse modalities differ in their scales, representation format, varied predictive power, weights, and contributions towards the final task [9]. Optimal data fusion schemes such as early [11], late [48], and hybrid fusion [49] schemes are developed to fuse the modalities at data, feature, decision, and intermediate mixed levels [50]. Deep neural nets [51],kernel-based methods [52], and graphical models [47,48] are employed for analysis and handling such data depending on the downstream task [46].…”
Section: Multimodal Machine Learningmentioning
confidence: 99%
“…These diverse modalities differ in their scales, representation format, varied predictive power, weights, and contributions towards the final task [9]. Optimal data fusion schemes such as early [11], late [48], and hybrid fusion [49] schemes are developed to fuse the modalities at data, feature, decision, and intermediate mixed levels [50]. Deep neural nets [51],kernel-based methods [52], and graphical models [47,48] are employed for analysis and handling such data depending on the downstream task [46].…”
Section: Multimodal Machine Learningmentioning
confidence: 99%
“…Multimodal Machine Learning —There is a long history of research in this area, exploring different directions [ 30 , 31 , 32 ]. Representation learning [ 33 , 34 , 35 ] is one of such directions in which effective and robust joint features are learned, typically from large-scale data sets, to be used in general downstream tasks, such as visual question answering or visual commonsense reasoning.…”
Section: Related Workmentioning
confidence: 99%
“…Despite the growing amount of works on LSM, most of these methods have neglected the importance of utilizing the learned representation from well trained model, which causes them fail to transfer the learned knowledge to other datasets. Learning good representation from the input data is a core problem for unsupervised learning [12]. Although some recent studies [13] and [14] have shown good performance on extracting representation with impressive properties, there still exist some key problems needed to be fulfilled.…”
Section: A Objectivesmentioning
confidence: 99%