Multimodal Compact Bilinear Pooling for Visual Question Answering and Visual Grounding

Fukui, Akira; Park, Dong Huk; Yang, Daylen; Rohrbach, Anna; Darrell, Trevor; Rohrbach, Marcus

doi:10.48550/arxiv.1606.01847

Cited by 287 publications

(426 citation statements)

References 30 publications

Supporting

Mentioning

421

Contrasting

Order By: Relevance

“…The main issue in bilinear operation is the high dimensionality of its output regarding the cardinality of the inputs. Recently, to overcome this shortcoming, compact bilinear pooling has been proposed [13], [14], [15], [21]. This pooling algorithm mimics results close to bilinear pooling while the dimensionality of the embedding space is relatively small.…”

Section: Feature Extraction and Fusionmentioning

confidence: 99%

See 1 more Smart Citation

Quality-Aware Multimodal Biometric Recognition

Soleymani¹,

Dabouei²,

Taherkhani³

et al. 2021

Preprint

View full text Add to dashboard Cite

We present a quality-aware multimodal recognition framework that combines representations from multiple biometric traits with varying quality and number of samples to achieve increased recognition accuracy by extracting complimentary identification information based on the quality of the samples. We develop a quality-aware framework for fusing representations of input modalities by weighting their importance using quality scores estimated in a weakly-supervised fashion. This framework utilizes two fusion blocks, each represented by a set of quality-aware and aggregation networks. In addition to architecture modifications, we propose two task-specific loss functions: multimodal separability loss and multimodal compactness loss. The first loss assures that the representations of modalities for a class have comparable magnitudes to provide a better quality estimation, while the multimodal representations of different classes are distributed to achieve maximum discrimination in the embedding space. The second loss, which is considered to regularize the network weights, improves the generalization performance by regularizing the framework. We evaluate the performance by considering three multimodal datasets consisting of face, iris, and fingerprint modalities. The efficacy of the framework is demonstrated through comparison with the state-of-the-art algorithms. In particular, our framework outperforms the rankand score-level fusion of modalities of BIOMDATA [1] by more than 30% for true acceptance rate at false acceptance rate of 10 −4 .

show abstract

Section: Feature Extraction and Fusionmentioning

confidence: 99%

“…The most commonly deployed feature fusion methods for multimodal frameworks presented in the literature are feature concatenation [9], [10], bilinear multiplication [11], [12], and compact bilinear pooling [13], [14], [15]. However, these methods treat all samples equally, and do not take their reliability and usefulness into account.…”

Section: Introductionmentioning

confidence: 99%

Quality-Aware Multimodal Biometric Recognition

Soleymani¹,

Dabouei²,

Taherkhani³

et al. 2021

Preprint

View full text Add to dashboard Cite

show abstract

“…For VQA tasks, the main multimodal challenges are how to represent the visual and language modalities and how to fuse them in order to perform the Question Answering (QA) task. In terms of representing the questions, word embeddings such as GloVe [17] are commonly used in conjunction with recurrent neural networks (RNNs) such as Long Short-Term Memory (LSTM) networks [6], for example by Fukui et al [4]. For representing the visual modality, grid-based Convolutional Neural Networks (CNNs) such as Resnet [9] are often used as visual feature extractors.…”

Section: Vqamentioning

confidence: 99%

Graph Relation Transformer: Incorporating pairwise object features into the Transformer architecture

Yang¹,

Zachary²,

Clive³

2021

Preprint

View full text Add to dashboard Cite

Previous studies such as VizWiz find that Visual Question Answering (VQA) systems that can read and reason about text in images are useful in application areas such as assisting visually-impaired people. TextVQA is a VQA dataset geared towards this problem, where the questions require answering systems to read and reason about visual objects and text objects in images. One key challenge in TextVQA is the design of a system that effectively reasons not only about visual and text objects individually, but also about the spatial relationships between these objects. This motivates the use of 'edge features', that is, information about the relationship between each pair of objects. Some current TextVQA models address this problem but either only use categories of relations (rather than edge feature vectors) or do not use edge features within the Transformer architectures. In order to overcome these shortcomings, we propose a Graph Relation Transformer (GRT), which uses edge information in addition to node information for graph attention computation in the Transformer. We find that, without using any other optimizations, the proposed GRT method outperforms the accuracy of the M4C baseline model by 0.65% on the val set and 0.57% on the test set. Qualitatively, we observe that the GRT has superior spatial reasoning ability to M4C. 1

show abstract

“…Visual Grounding: Visual grounding models encourage captioning generators to link phrases with specific spatial regions of images or videos, thereby presenting a potential way to improve the explainability of models [7,24,35,46,49,53]. The most common way of grounding models is to predict the next word using an attention mechanism, which is deployed over noun phrases, with supervised bounding boxes as input.…”

Section: Related Workmentioning

confidence: 99%

“…7 on SPICE because our method learns the refined representation of SG, which provides relational knowledge and positional semantic prior, to improve this score. It is noteworthy that the RGL (w/o OG) achieves almost all the best captioning scores, this is reasonable, because without grounding operation, the captioning model may pay more attention to the description generation.…”

mentioning

confidence: 99%

Relational Graph Learning for Grounded Video Description Generation

Zhang

Wang

Tang

et al. 2020

Proceedings of the 28th ACM International Conference on Multimedia

View full text Add to dashboard Cite

Grounded video description (GVD) encourages captioning models to attend to appropriate video regions (e.g., objects) dynamically and generate a description. Such a setting can help explain the decisions of captioning models and prevents the model from hallucinating object words in its description. However, such design mainly focuses on object word generation and thus may ignore fine-grained information and suffer from missing visual concepts. Moreover, relational words (e.g., "jump left or right") are usual spatio-temporal inference results, i.e., these words cannot be grounded on certain spatial regions. To tackle the above limitations, we design a novel relational graph learning framework for GVD, in which a language-refined scene graph representation is designed to explore fine-grained visual concepts. Furthermore, the refined graph can be regarded as relational inductive knowledge to assist captioning models in selecting the relevant information it needs to generate correct words. We validate the effectiveness of our model through automatic metrics and human evaluation, and the results indicate that our approach can generate more fine-grained and accurate description, and it solves the problem of object hallucination to some extent. CCS CONCEPTS • Computing methodologies → Scene understanding.

show abstract

Multimodal Compact Bilinear Pooling for Visual Question Answering and Visual Grounding

Cited by 287 publications

References 30 publications

Quality-Aware Multimodal Biometric Recognition

Quality-Aware Multimodal Biometric Recognition

Graph Relation Transformer: Incorporating pairwise object features into the Transformer architecture

Relational Graph Learning for Grounded Video Description Generation

Contact Info

Product

Resources

About