2019 IEEE/CVF International Conference on Computer Vision (ICCV) 2019
DOI: 10.1109/iccv.2019.00048
|View full text |Cite
|
Sign up to set email alerts
|

Compact Trilinear Interaction for Visual Question Answering

Abstract: In Visual Question Answering (VQA), answers have a great correlation with question meaning and visual contents. Thus, to selectively utilize image, question and answer information, we propose a novel trilinear interaction model which simultaneously learns high level associations between these three inputs. In addition, to overcome the interaction complexity, we introduce a multimodal tensor-based PARALIND decomposition which efficiently parameterizes trilinear interaction between the three inputs. Moreover, kn… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
46
0

Year Published

2019
2019
2023
2023

Publication Types

Select...
5
3

Relationship

0
8

Authors

Journals

citations
Cited by 62 publications
(46 citation statements)
references
References 35 publications
0
46
0
Order By: Relevance
“…The aggregator needs to detect high-level interactions between the three streams to provide a meaningful answer, without erasing the lower-level interactions extracted in the previous steps. We design the aggregator by applying the Compact Trilinear Interaction (CTI) (Do et al, 2019) to question, answer, and image features and generate a vector to jointly represent the three features.…”
Section: Concept-vision-language Embedding Modulementioning
confidence: 99%
See 1 more Smart Citation
“…The aggregator needs to detect high-level interactions between the three streams to provide a meaningful answer, without erasing the lower-level interactions extracted in the previous steps. We design the aggregator by applying the Compact Trilinear Interaction (CTI) (Do et al, 2019) to question, answer, and image features and generate a vector to jointly represent the three features.…”
Section: Concept-vision-language Embedding Modulementioning
confidence: 99%
“…However, it leads to an increase in the computational cost. We set R = 32 in Equation 5, the same value as in the CTI (Do et al, 2019) for the slicing parameter. optimizer with an initial learning rate of 4e-5.…”
Section: Concept-vision-language Embeddingmentioning
confidence: 99%
“…Our GATs include 16 attention heads. In the fusion model, R = 32 as suggested in (Do et al, 2019) and d z = 512 since it leads to the best results in our model.…”
Section: Methodsmentioning
confidence: 86%
“…Popular fusion methods such as BAN (Kim et al, 2018) or MUTAN (Ben-younes et al, 2017) are not suitable for our work since we have three types of features to fuse. Therefore, we design a fusion method by applying the Compact Trilinear Interaction (CTI) (Do et al, 2019) to the question embeddings, scene graph visual features, and concept features and generate a vector to jointly represent the three features.…”
Section: Multimodal Fusionmentioning
confidence: 99%
“…As encoderdecoder based framework is widely used for sequence machine learning, here we focus on the discussion of multi-modal fusion in the encoder stage or the decoder stage. Generally, fusing features in encoder can acquire better performance than which in decoder [38,39], as information from different modalities can interact earlier. However, we usually lack of optimal mapping between different modalities, which makes the encoder fusion challenging.…”
Section: Multi-modal Machine Learningmentioning
confidence: 99%