Semantic Aligned Multi-modal Transformer for Vision-LanguageUnderstanding: A Preliminary Study on Visual QA

Ding, Han; Li, Li Erran; Hu, Zhiting; Xu, Yi; Hakkani-Tür, Dilek; Du, Zheng; Zeng, Belinda

doi:10.18653/v1/2021.maiworkshop-1.11

Cited by 2 publications

(1 citation statement)

References 13 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Depth wise separable convolution is a type of convolution that makes the model to learn local relationships. Semantic Aligned Multi-modal Transformer (12) is utilized to enhance the alignment mechanism by incorporating image scene graph structures as a bridge between vision and language. (13) proposed MMFT-BERT.…”

Section: Transformer Based Approaches For Visual Question Answeringmentioning

confidence: 99%

A Review of Recent Advances in Visual Question Answering: Capsule Networks and Vision Transformers in Focus

Prakash,

Devananda

2024

IJST

View full text Add to dashboard Cite

Objectives: Multimodal deep learning, incorporating images, text, videos, speech, and acoustic signals, has grown significantly. This article aims to explore the untapped possibilities of multimodal deep learning in Visual Question Answering (VQA) and address a research gap in the development of effective techniques for comprehensive image feature extraction. Methods: This article provides a comprehensive overview of VQA and the associated challenges. It emphasizes the need for an extensive representation of images in VQA and pinpoints the specific research gap pertaining to image feature extraction and highlights the fundamental concepts of VQA, the challenges faced, different approaches and applications used for VQA tasks. A substantial portion of this review is devoted to investigating recent advancements in image feature extraction techniques. Findings: Most existing VQA research predominantly emphasizes the accurate matching of answers to given questions, often overlooking the necessity for a comprehensive representation of images. These models primarily rely on question content analysis while underemphasizing image understanding or sometimes neglect image examination entirely. There is also a tendency in multimodal systems to neglect or overemphasize one modality, notably the visual one, which challenges genuine multimodal integration. This article reveals that there is limited benchmarking for image feature extraction techniques. Evaluating the quality of extracted image features is crucial for VQA tasks. Novelty: While many VQA studies have primarily concentrated on the accuracy of answers to questions, this review emphasizes the importance of comprehensive image representation. The paper explores recent advances in Capsules Networks (CapsNets) and Vision Transformers (ViTs) as alternatives to traditional Convolutional Neural Networks (CNNs), for development of more effective image feature extraction techniques which can help to address the limitations of existing VQA models that focus primarily on question content analysis. https://www.indjst.org/

show abstract

Section: Transformer Based Approaches For Visual Question Answeringmentioning

confidence: 99%