2019
DOI: 10.1007/978-3-030-36802-9_24
|View full text |Cite
|
Sign up to set email alerts
|

Intra-Modality Feature Interaction Using Self-attention for Visual Question Answering

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1

Citation Types

0
2
0

Year Published

2021
2021
2024
2024

Publication Types

Select...
3
1

Relationship

0
4

Authors

Journals

citations
Cited by 4 publications
(2 citation statements)
references
References 16 publications
0
2
0
Order By: Relevance
“…Furthermore, some approaches [49], [57], [58], [59], [60] incorporated object detectors to enhance the extraction of visual features. Attention mechanisms [49], [61], [62] and graph neural networks [44], [63] have gained increasing popularity in recent years due to their ability to capture fine-grained visual details and contextual relationships. Notably, methods incorporating VLP models, such as ConZIC [64], have demonstrated efficient performance in image captioning.…”
Section: A Image Captioningmentioning
confidence: 99%
“…Furthermore, some approaches [49], [57], [58], [59], [60] incorporated object detectors to enhance the extraction of visual features. Attention mechanisms [49], [61], [62] and graph neural networks [44], [63] have gained increasing popularity in recent years due to their ability to capture fine-grained visual details and contextual relationships. Notably, methods incorporating VLP models, such as ConZIC [64], have demonstrated efficient performance in image captioning.…”
Section: A Image Captioningmentioning
confidence: 99%
“…Researchers have successfully employed the visual attention method in the VQA task in the past few years. However, the traditional attention methods are extended to different directions, such as region-based attention methods, [1][2][3] object-based attention, [4][5][6] and semantic concept-based attention methods. [7][8][9] However, in all these methods, particular visual features such as region, object, or semantic concept features are considered to construct the attention mechanism, which is insufficient for feature representation.…”
Section: Introductionmentioning
confidence: 99%