2016
DOI: 10.48550/arxiv.1606.01847
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Multimodal Compact Bilinear Pooling for Visual Question Answering and Visual Grounding

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

5
421
0

Year Published

2018
2018
2023
2023

Publication Types

Select...
4
3
1

Relationship

0
8

Authors

Journals

citations
Cited by 287 publications
(426 citation statements)
references
References 30 publications
5
421
0
Order By: Relevance
“…The main issue in bilinear operation is the high dimensionality of its output regarding the cardinality of the inputs. Recently, to overcome this shortcoming, compact bilinear pooling has been proposed [13], [14], [15], [21]. This pooling algorithm mimics results close to bilinear pooling while the dimensionality of the embedding space is relatively small.…”
Section: Feature Extraction and Fusionmentioning
confidence: 99%
See 1 more Smart Citation
“…The main issue in bilinear operation is the high dimensionality of its output regarding the cardinality of the inputs. Recently, to overcome this shortcoming, compact bilinear pooling has been proposed [13], [14], [15], [21]. This pooling algorithm mimics results close to bilinear pooling while the dimensionality of the embedding space is relatively small.…”
Section: Feature Extraction and Fusionmentioning
confidence: 99%
“…The most commonly deployed feature fusion methods for multimodal frameworks presented in the literature are feature concatenation [9], [10], bilinear multiplication [11], [12], and compact bilinear pooling [13], [14], [15]. However, these methods treat all samples equally, and do not take their reliability and usefulness into account.…”
Section: Introductionmentioning
confidence: 99%
“…For VQA tasks, the main multimodal challenges are how to represent the visual and language modalities and how to fuse them in order to perform the Question Answering (QA) task. In terms of representing the questions, word embeddings such as GloVe [17] are commonly used in conjunction with recurrent neural networks (RNNs) such as Long Short-Term Memory (LSTM) networks [6], for example by Fukui et al [4]. For representing the visual modality, grid-based Convolutional Neural Networks (CNNs) such as Resnet [9] are often used as visual feature extractors.…”
Section: Vqamentioning
confidence: 99%
“…Visual Grounding: Visual grounding models encourage captioning generators to link phrases with specific spatial regions of images or videos, thereby presenting a potential way to improve the explainability of models [7,24,35,46,49,53]. The most common way of grounding models is to predict the next word using an attention mechanism, which is deployed over noun phrases, with supervised bounding boxes as input.…”
Section: Related Workmentioning
confidence: 99%
“…7 on SPICE because our method learns the refined representation of SG, which provides relational knowledge and positional semantic prior, to improve this score. It is noteworthy that the RGL (w/o OG) achieves almost all the best captioning scores, this is reasonable, because without grounding operation, the captioning model may pay more attention to the description generation.…”
mentioning
confidence: 99%