2017 IEEE International Conference on Multimedia and Expo (ICME) 2017
DOI: 10.1109/icme.2017.8019540
|View full text |Cite
|
Sign up to set email alerts
|

Adaptive attention fusion network for visual question answering

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
2
1

Citation Types

0
6
0

Year Published

2022
2022
2024
2024

Publication Types

Select...
5

Relationship

0
5

Authors

Journals

citations
Cited by 5 publications
(6 citation statements)
references
References 11 publications
0
6
0
Order By: Relevance
“…Many approaches to VQA have been explored in the past. For example, [4,5,27,28] used attention-based mechanisms to solve VQA problems. Instead of attending to every detail of the image and every word of the inquiry, an attention mechanism enables us to concentrate solely on the most crucial portions of the image and questions.…”
Section: Related Workmentioning
confidence: 99%
See 1 more Smart Citation
“…Many approaches to VQA have been explored in the past. For example, [4,5,27,28] used attention-based mechanisms to solve VQA problems. Instead of attending to every detail of the image and every word of the inquiry, an attention mechanism enables us to concentrate solely on the most crucial portions of the image and questions.…”
Section: Related Workmentioning
confidence: 99%
“…Different approaches have been experimented with for achieving each of the steps in the literature. For image feature extraction, recent deep learning Models like VGGNet [1][2][3][4][5][6], ResNet [7][8][9], and even F-RCNN [10][11][12] have proved to be a boon. For question feature extraction, literature explored various word embedding techniques ranging from simple word2vec [13,14], and Glove [15,16] to complex LSTM, GRU [14,15,17], and transformers [18][19][20].…”
Section: Introductionmentioning
confidence: 99%
“…Most VQA literature utilizes Convolution Neural Networks (CNN) and their variants for image featurization. In the beginning, many researchers used VGGNet for the extraction of image features [1,2,3,4,5]. They used the final hidden layer of VGG-Net as image features as most of the spatial information is retained in the last hidden layer.…”
Section: Feature Extractionmentioning
confidence: 99%
“…3 We critically analyze all the state-of-the-art (SoTa), end-to-end models, for VQA, their limitations, and future improvements. 4 Finally, we provide the guidelines and future direction for further improvements in VQA models.…”
Section: Introductionmentioning
confidence: 99%
“…The model was evaluated on VQA 1.0 dataset and they achieved an accuracy of 62.1%. A similar hierarchical approach is used in [25] with novel multi-step reasoning and adaptive fusion method. Here, the model performs textual and, visual attention through two steps.…”
Section: Related Workmentioning
confidence: 99%