Multi-Modal Alignment of Visual Question Answering Based on Multi-Hop Attention Mechanism

Xia, Qihao; Yu, Caiyan; Hou, Yinong; Peng, Pai; Zheng, Zhengqi; Chen, Wen

doi:10.3390/electronics11111778

Cited by 6 publications

(4 citation statements)

References 38 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Here self-attention and guided attention are models that learn interactions between different parts of image and the question, enhancing the accuracy of the answer by learning the rich interactions between visual and language streams. The model proposed by (19) introduces a multi-hop attention alignment method that enriches surrounding information when using self-attention.…”

Section: Attention Based Approaches For Visual Question Answeringmentioning

confidence: 99%

“…It enhances accuracy by utilizing a CNN that predicts bounding boxes for objects in an image. https://www.indjst.org/ (54) CPDR (55) , MulFA/UFSCAN (56) , Bilinear Graph (57) , AttReg (58) , AMAM (16) , Scene-text using PHOC (59) , MGRF (60) , Bottom-Up and Top-Down (61) , DCAMN (39) , Skill Concept (62) , PGM (63) , SR-OCE (64) , RAMEN (65) , CSST (66) , Coarse-to-Fine (67) , GMA (68) , BLOCK (69) , CapsAtt (32,40) , Re-attention (70) , CRN (71) , CAT (11) , shortcut (72) , DAQC (15) , MGFAN (73) , MMMH (19) , MSG (74) , Fair-VQA (75) , Attention map (5) , SAVQA (76) , MGAVQA (77) , MuKEA (78) , ACVRM (79) , QD-GFN (23) , Swap-Mix (80) , CVA (17) , HGNMN (26) , SUPER (37) , Uncertainty based (81) , CLG (82) , WSQG (83) , VLR (84) , LXMERT (85) , SceneGATE (86)…”

Section: Visual Feature Extraction Techniquesmentioning

confidence: 99%

See 1 more Smart Citation

A Review of Recent Advances in Visual Question Answering: Capsule Networks and Vision Transformers in Focus

Prakash,

Devananda

2024

IJST

View full text Add to dashboard Cite

Objectives: Multimodal deep learning, incorporating images, text, videos, speech, and acoustic signals, has grown significantly. This article aims to explore the untapped possibilities of multimodal deep learning in Visual Question Answering (VQA) and address a research gap in the development of effective techniques for comprehensive image feature extraction. Methods: This article provides a comprehensive overview of VQA and the associated challenges. It emphasizes the need for an extensive representation of images in VQA and pinpoints the specific research gap pertaining to image feature extraction and highlights the fundamental concepts of VQA, the challenges faced, different approaches and applications used for VQA tasks. A substantial portion of this review is devoted to investigating recent advancements in image feature extraction techniques. Findings: Most existing VQA research predominantly emphasizes the accurate matching of answers to given questions, often overlooking the necessity for a comprehensive representation of images. These models primarily rely on question content analysis while underemphasizing image understanding or sometimes neglect image examination entirely. There is also a tendency in multimodal systems to neglect or overemphasize one modality, notably the visual one, which challenges genuine multimodal integration. This article reveals that there is limited benchmarking for image feature extraction techniques. Evaluating the quality of extracted image features is crucial for VQA tasks. Novelty: While many VQA studies have primarily concentrated on the accuracy of answers to questions, this review emphasizes the importance of comprehensive image representation. The paper explores recent advances in Capsules Networks (CapsNets) and Vision Transformers (ViTs) as alternatives to traditional Convolutional Neural Networks (CNNs), for development of more effective image feature extraction techniques which can help to address the limitations of existing VQA models that focus primarily on question content analysis. https://www.indjst.org/

show abstract

Section: Attention Based Approaches For Visual Question Answeringmentioning

confidence: 99%

Section: Visual Feature Extraction Techniquesmentioning

confidence: 99%

A Review of Recent Advances in Visual Question Answering: Capsule Networks and Vision Transformers in Focus

Prakash,

Devananda

2024

IJST

View full text Add to dashboard Cite

show abstract

“…Bahdanau used attention mechanisms to complete the task of machine interpretation for the first time [31]. Then, various types of attention mechanisms took place, such as Co-Attention networks [32], Self-Attention networks [33] and Recurrent Attention networks [34].…”

Section: Introductionmentioning

confidence: 99%

Complex Real-Time Monitoring and Decision-Making Assistance System Based on Hybrid Forecasting Module and Social Network Analysis

Fan,

Li,

et al. 2024

Systems

View full text Add to dashboard Cite

Timely short-term spatial air quality forecasting is essential for monitoring and prevention in urban agglomerations, providing a new perspective on joint air pollution prevention. However, a single model on air pollution forecasting or spatial correlation analysis is insufficient to meet the strong demand. Thus, this paper proposed a complex real-time monitoring and decision-making assistance system, using a hybrid forecasting module and social network analysis. Firstly, before an accurate forecasting module was constructed, text sentiment analysis and a strategy based on multiple feature selection methods and result fusion were introduced to data preprocessing. Subsequently, CNN-D-LSTM was proposed to improve the feature capture ability to make forecasting more accurate. Then, social network analysis was utilized to explore the spatial transporting characteristics, which could provide solutions to joint prevention and control in urban agglomerations. For experiment simulation, two comparative experiments were constructed for individual models and city cluster forecasting, in which the mean absolute error decreases to 7.8692 and the Pearson correlation coefficient is 0.9816. For overall spatial cluster forecasting, related experiments demonstrated that with appropriate cluster division, the Pearson correlation coefficient could be improved to nearly 0.99.

show abstract

“…That is, the observation can be adjusted to the more informative features according to their relative importance, focusing the algorithm on the most relevant parts of the input, moving from focusing on global features to the focused features, thus saving resources and getting the most effective information quickly. The attention mechanism has arguably become one of the most important concepts in the field of deep learning, since Bahdanau, Cho & Bengio (2015) used attention mechanism for the machine interpretation tasks, various variants of attention mechanism have emerged, such as Co-Attention networks ( Yang et al, 2019a ; Han et al, 2021 ; Yu et al, 2019 ; Liu et al, 2021b ; Lu et al, 2016 ; Sharma & Srivastava, 2022 ), Recurrent Attention networks ( Osman & Samek, 2019 ; Ren & Zemel, 2017 ; Gan et al, 2019 ), Self-Attention networks ( Li et al, 2019 ; Fan et al, 2019 ; Ramachandran et al, 2019 ; Xia et al, 2022 ; Xiang et al, 2022 ; Yan, Silamu & Li, 2022 ), etc. The effectiveness of visual information processing is considerably enhanced by all of these attention mechanisms, which also optimize VQA performance.…”

Section: Introductionmentioning

confidence: 99%

The multi-modal fusion in visual question answering: a review of attention mechanisms

Liu

Yin

et al. 2023

PeerJ Computer Science

185

View full text Add to dashboard Cite

Visual Question Answering (VQA) is a significant cross-disciplinary issue in the fields of computer vision and natural language processing that requires a computer to output a natural language answer based on pictures and questions posed based on the pictures. This requires simultaneous processing of multimodal fusion of text features and visual features, and the key task that can ensure its success is the attention mechanism. Bringing in attention mechanisms makes it better to integrate text features and image features into a compact multi-modal representation. Therefore, it is necessary to clarify the development status of attention mechanism, understand the most advanced attention mechanism methods, and look forward to its future development direction. In this article, we first conduct a bibliometric analysis of the correlation through CiteSpace, then we find and reasonably speculate that the attention mechanism has great development potential in cross-modal retrieval. Secondly, we discuss the classification and application of existing attention mechanisms in VQA tasks, analysis their shortcomings, and summarize current improvement methods. Finally, through the continuous exploration of attention mechanisms, we believe that VQA will evolve in a smarter and more human direction.

show abstract

Multi-Modal Alignment of Visual Question Answering Based on Multi-Hop Attention Mechanism

Cited by 6 publications

References 38 publications

A Review of Recent Advances in Visual Question Answering: Capsule Networks and Vision Transformers in Focus

A Review of Recent Advances in Visual Question Answering: Capsule Networks and Vision Transformers in Focus

Complex Real-Time Monitoring and Decision-Making Assistance System Based on Hybrid Forecasting Module and Social Network Analysis

The multi-modal fusion in visual question answering: a review of attention mechanisms

Contact Info

Product

Resources

About