Uncertainty-Aware Multi-modal Learning via Cross-Modal Random Network Prediction

Wang, Hu; Zhang, Jianpeng; Chen, Yuanhong; Ma, Chao; Avery, Jodie; Hull, Louise; Carneiro, Gustavo

doi:10.1007/978-3-031-19836-6_12

Cited by 10 publications

(4 citation statements)

References 22 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…It is labelled with the suffix "partial" because it can make predictions in the case of incomplete observations as well. We applied a simple method used in multimodal modelling [28] to replace missing modality features. This method substitutes the features of a missing modality with those of other modalities.…”

Section: Model Architecturementioning

confidence: 99%

“…Additionally, we experimented with multimodal methods to address the problem of the entire model being unfeasible in the event of the absence of any one modality. These methods have been studied in various fields [25], [26], [27], [28]. However, in this study, considering the case when MTSO data were found to be unavailable, we introduced a simple method to replace the missing MTSO data with feature data from CCTV images; subsequently, we tested the effectiveness of this method.…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

FogFusionNet: Coastal Sea Fog Prediction by Using a Multimodal Deep Learning Approach

Son,

Chun,

Kim

et al. 2024

IEEE Access

View full text Add to dashboard Cite

In this study, we designed FogFusionNet, a multimodal sea fog prediction model, that used closed-circuit television (CCTV) images and multivariate time series observation (MTSO) data to predict three visibility classes-Normal visibility, Low visibility, and Sea fog-at 1-h intervals from the current time to 6-h in the future for a specific region. We applied weighted sampling and weighted loss to overcome the imbalance of each visibility class, and additionally evaluated the effect of replacing missing MTSO data. A total of 4 years of data regarding Incheon Port, which faces the Yellow Sea and is prone to sea fog, were collected for training and verifying FogFusionNet. Of these, 3 years of data were used for training FogFusionNet, and the remaining 1 year of data were used for verifying the performance of FogFusionNet. The prediction performance of FogFusionNet at 1-h intervals was 86.2% (0-h), 79.1% (1-h), 73.4% (2-h), 70.7% (3-h), 64.7% (4-h), 59.6% (5-h), and 49.3% (6-h), showing an average prediction performance of 69.0%. FogFusioneNet is expected to promote coastal safety and reduce economic losses due to coastal sea fog. INDEX TERMSClosed-circuit television (CCTV) images, Coastal sea fog prediction, Multimodal learning, Multivariate time series observation data (MTSO), Visibility class.

show abstract

Section: Model Architecturementioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

FogFusionNet: Coastal Sea Fog Prediction by Using a Multimodal Deep Learning Approach

Son,

Chun,

Kim

et al. 2024

IEEE Access

View full text Add to dashboard Cite

show abstract

“…Similarly, Tian et al [49] explored the use of uncertainty estimation in fusing the softmax scores predicted using CNNs for semantic segmentation. Other notable approaches to uncertainty-aware multimodal fusion are based on optimal transport for cross-modal correspondence [50], random prior functions [51], boosted ensembles [52], and factorised deep markov models [53].…”

Section: Related Workmentioning

confidence: 99%

COLD Fusion: Calibrated and Ordinal Latent Distribution Fusion for Uncertainty-Aware Multimodal Emotion Recognition

Tellamekala,

Amiriparian,

Schuller

et al. 2024

IEEE Trans. Pattern Anal. Mach. Intell.

View full text Add to dashboard Cite

Automatically recognising apparent emotions from face and voice is hard, in part because of various sources of uncertainty, including in the input data and the labels used in a machine learning framework. This paper introduces an uncertainty-aware multimodal fusion approach that quantifies modality-wise aleatoric or data uncertainty towards emotion prediction. We propose a novel fusion framework, in which latent distributions over unimodal temporal context are learned by constraining their variance. These variance constraints, Calibration and Ordinal Ranking, are designed such that the variance estimated for a modality can represent how informative the temporal context of that modality is w.r.t. emotion recognition. When well-calibrated, modality-wise uncertainty scores indicate how much their corresponding predictions are likely to differ from the ground truth labels. Well-ranked uncertainty scores allow the ordinal ranking of different frames across different modalities. To jointly impose both these constraints, we propose a softmax distributional matching loss. Our evaluation on AVEC 2019 CES, CMU-MOSEI, and IEMOCAP datasets shows that the proposed multimodal fusion method not only improves the generalisation performance of emotion recognition models and their predictive uncertainty estimates, but also makes the models robust to novel noise patterns encountered at test time.

show abstract

“…However, the combination of multimodal data is usually challenging. There are studies on fusing multimodal data according to their uncertainties, but this may face numerical instability and is difficult to transfer from one application to another [19]. Instead of directly fusing the multisensory data in a numerical space, we propose to use multimodal modules to translate them into natural language expressions that an LLM can easily digest.…”

Section: Related Workmentioning

confidence: 99%

Chat with the Environment: Interactive Multimodal Perception using Large Language Models

Zhao¹,

Li²,

Weber³

et al. 2023

Preprint

View full text Add to dashboard Cite

Programming robot behaviour in a complex world faces challenges on multiple levels, from dextrous low-level skills to high-level planning and reasoning. Recent pre-trained Large Language Models (LLMs) have shown remarkable reasoning ability in zero-shot robotic planning. However, it remains challenging to ground LLMs in multimodal sensory input and continuous action output, while enabling a robot to interact with its environment and acquire novel information as its policies unfold. We develop a robot interaction scenario with a partially observable state, which necessitates a robot to decide on a range of epistemic actions in order to sample sensory information among multiple modalities, before being able to execute the task correctly. An interactive perception framework is therefore proposed with an LLM as its backbone, whose ability is exploited to instruct epistemic actions and to reason over the resulting multimodal sensations (vision, sound, haptics, proprioception), as well as to plan an entire task execution based on the interactively acquired information. Our study demonstrates that LLMs can provide high-level planning and reasoning skills and control interactive robot behaviour in a multimodal environment, while multimodal modules with the context of the environmental state help ground the LLMs and extend their processing ability.

show abstract

Uncertainty-Aware Multi-modal Learning via Cross-Modal Random Network Prediction

Cited by 10 publications

References 22 publications

FogFusionNet: Coastal Sea Fog Prediction by Using a Multimodal Deep Learning Approach

FogFusionNet: Coastal Sea Fog Prediction by Using a Multimodal Deep Learning Approach

COLD Fusion: Calibrated and Ordinal Latent Distribution Fusion for Uncertainty-Aware Multimodal Emotion Recognition

Chat with the Environment: Interactive Multimodal Perception using Large Language Models

Contact Info

Product

Resources

About