MIRTT: Learning Multimodal Interaction Representations from Trilinear Transformers for Visual Question Answering

Wang, Junjie; Ji, Yatai; Sun, Jiaqi; Yang, Yujiu; Sakai, Tetsuya

doi:10.18653/v1/2021.findings-emnlp.196

Cited by 5 publications

(2 citation statements)

References 30 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Cao et al [162] provided the blueprint for the decision tree and proposed a parse-tree-guided reasoning network for interpretable VQA. Wang et al [163] fashioned a model that learns multimodal interaction representations from trilinear transformers (MIRTT) for VQA tasks. In the domain of video QA, Peng et al [164] unveiled a multilevel hierarchical network (MHN) that takes into account the information spanning various temporal scales.…”

Section: Cognitionmentioning

confidence: 99%

Multimodal Human–Robot Interaction for Human‐Centric Smart Manufacturing: A Survey

Wang,

Zheng,

et al. 2023

Advanced Intelligent Systems

View full text Add to dashboard Cite

Human–robot interaction (HRI) has escalated in notability in recent years, and multimodal communication and control strategies are necessitated to guarantee a secure, efficient, and intelligent HRI experience. In spite of the considerable focus on multimodal HRI, comprehensive disquisitions delineating various modalities and intricately analyzing their combinations remain elusive, consequently limiting holistic understanding and future advancements. This article aspires to bridge this inadequacy by conducting a profound exploration of multimodal HRI, predominantly concentrating on four principal modalities: vision, auditory and language, haptics, and physiological sensing. An extensive review encapsulating algorithmic dissection, interface devices, and applicative dimensions forms part of this discourse. This manuscript distinctively combines multimodal HRI with cognitive science, deeply probing into the three dimensions, perception, cognition, and action, thereby demystifying algorithms intrinsic to multimodal HRI. Finally, it accentuates the empirical challenges and contours preemptive trajectories for multimodal HRI in human‐centric smart manufacturing.

show abstract

Section: Cognitionmentioning

confidence: 99%

Multimodal Human–Robot Interaction for Human‐Centric Smart Manufacturing: A Survey

Wang,

Zheng,

et al. 2023

Advanced Intelligent Systems

View full text Add to dashboard Cite

show abstract

“…These uncertainties show considerable challenges in the effective training of AI models for these specialized applications. Contrary to addressing these issues, existing methods [5]- [7] often overlook these uncertainties, which often results in limited capabilities in comprehending complex concept hierarchies and a lack of prediction diversity. Therefore, it is imperative to model such multimodal uncertainties.…”

Section: Introductionmentioning

confidence: 99%

Modeling Multimodal Uncertainties via Probability Distribution Encoders Included Vision-Language Models

Wang,

Ji,

Zhang

et al. 2024

IEEE Access

Self Cite

View full text Add to dashboard Cite

In the field of multimodal understanding and generation, tackling inherent uncertainties is essential for mitigating ambiguous interpretations across multiple targets. We introduce the Probability Distribution Encoder (PDE), a versatile, plug-and-play module that utilizes sequence-level and featurelevel interactions to model these uncertainties as probabilistic distributions. Furthermore, we demonstrate its adaptability by seamlessly integrating PDE into established frameworks, culminating in models like SWINPDE. Compared to previous methods, our probabilistic approach substantially enriches multimodal semantic understanding. In addition to specific tasks, the unlabeled data contains rich prior knowledge, especially multimodal uncertainties. However, current pre-training methods are designed based on point representations, which hinders the effective functioning of our distribution representations. Therefore, we incorporate this uncertainty modeling into three new pre-training strategies: Distribution-based Vision-Language Contrastive Learning (D-VLC), Distribution-based Masked Language Modeling (D-MLM), and Distribution-based Image-Text Matching (D-ITM). Empirical experiments show that our models achieve State-of-the-Art (SOTA) results in a range of downstream tasks, including image-text retrieval, visual question answering, visual reasoning, visual entailment and video captioning. Furthermore, the qualitative results reveal several superior properties conferred by our methods, such as improved semantic expressiveness over point representations, and the ability to generate diverse yet accurate predictions.

show abstract

MAP: Multimodal Uncertainty-Aware Vision-Language Pre-training Model

Ji,

Wang,

Gong

et al. 2023

2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

View full text Add to dashboard Cite

MIRTT: Learning Multimodal Interaction Representations from Trilinear Transformers for Visual Question Answering

Cited by 5 publications

References 30 publications

Multimodal Human–Robot Interaction for Human‐Centric Smart Manufacturing: A Survey

Multimodal Human–Robot Interaction for Human‐Centric Smart Manufacturing: A Survey

Modeling Multimodal Uncertainties via Probability Distribution Encoders Included Vision-Language Models

MAP: Multimodal Uncertainty-Aware Vision-Language Pre-training Model

Contact Info

Product

Resources

About