2016
DOI: 10.48550/arxiv.1610.04325
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Hadamard Product for Low-rank Bilinear Pooling

Jin-Hwa Kim,
Kyoung-Woon On,
Woosang Lim
et al.

Abstract: Bilinear models provide rich representations compared with linear models. They have been applied in various visual tasks, such as object recognition, segmentation, and visual question-answering, to get state-of-the-art performances taking advantage of the expanded representations. However, bilinear representations tend to be high-dimensional, limiting the applicability to computationally complex tasks. We propose low-rank bilinear pooling using Hadamard product for an efficient attention mechanism of multimoda… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

1
146
0

Year Published

2018
2018
2022
2022

Publication Types

Select...
5
4

Relationship

0
9

Authors

Journals

citations
Cited by 95 publications
(147 citation statements)
references
References 17 publications
1
146
0
Order By: Relevance
“…Although multi-modal attention has shown great promise in text-image summarization tasks, it itself is not sufficient for text-video-audio summarization tasks [61]. Hence, to overcome this weakness, Fu et al [29] proposed bi-hop attention as an extension of bi-linear attention [50], and Li et al [61] developed a novel conditional self-attention mechanism module to capture local semantic information of video conditioned on the input text information. Both of these techniques were backed empirically, and established state-of-the-art in their respective problems.…”
Section: Neural Modelsmentioning
confidence: 99%
“…Although multi-modal attention has shown great promise in text-image summarization tasks, it itself is not sufficient for text-video-audio summarization tasks [61]. Hence, to overcome this weakness, Fu et al [29] proposed bi-hop attention as an extension of bi-linear attention [50], and Li et al [61] developed a novel conditional self-attention mechanism module to capture local semantic information of video conditioned on the input text information. Both of these techniques were backed empirically, and established state-of-the-art in their respective problems.…”
Section: Neural Modelsmentioning
confidence: 99%
“…Since the dot-product correlation contracts all channel dimensions of the query and the keys, we may lose semantic information, which may help in generating an effective relational kernel. We thus take the Hadamard product [23] instead so that we can leverage channel-wise query-key correlations for producing the relational kernel. Using a learnable kernel projection matrix H ∈ R M C×M , Eq.…”
Section: Our Approachmentioning
confidence: 99%
“…After the encoding stage, image features and sequence features are first fused and then fed into the image decoder and the sequence decoder. Representative methods for multimodal feature fusion include Concat+MLP, MCB [Fukui et al 2016], MLB [Kim et al 2016] and MFB [Yu et al 2017]. Con-cat+MLP (Concatenation + Multi-Layer Perceptron), one of the most intuitive strategies, is adopted in our model, which is proven to be effective through our experiments.…”
Section: Multi-modality Representation Learningmentioning
confidence: 99%