Hadamard Product for Low-rank Bilinear Pooling

Kim, Jin-Hwa; On, Kyoung-Woon; Lim, Woosang; Kim, Jeonghee; Ha, Jung-Woo; Zhang, Byoung-Tak

doi:10.48550/arxiv.1610.04325

Cited by 95 publications

(147 citation statements)

References 17 publications

Supporting

Mentioning

146

Contrasting

Order By: Relevance

“…Although multi-modal attention has shown great promise in text-image summarization tasks, it itself is not sufficient for text-video-audio summarization tasks [61]. Hence, to overcome this weakness, Fu et al [29] proposed bi-hop attention as an extension of bi-linear attention [50], and Li et al [61] developed a novel conditional self-attention mechanism module to capture local semantic information of video conditioned on the input text information. Both of these techniques were backed empirically, and established state-of-the-art in their respective problems.…”

Section: Neural Modelsmentioning

confidence: 99%

A Survey on Multi-modal Summarization

Jangra¹,

Mukherjee²,

Jatowt³

et al. 2021

Preprint

View full text Add to dashboard Cite

The new era of technology has brought us to the point where it is convenient for people to share their opinions over an abundance of platforms. These platforms have a provision for the users to express themselves in multiple forms of representations, including text, images, videos, and audio. This, however, makes it difficult for users to obtain all the key information about a topic, making the task of automatic multi-modal summarization (MMS) essential. In this paper, we present a comprehensive survey of the existing research in the area of MMS.

show abstract

Section: Neural Modelsmentioning

confidence: 99%

A Survey on Multi-modal Summarization

Jangra¹,

Mukherjee²,

Jatowt³

et al. 2021

Preprint

View full text Add to dashboard Cite

show abstract

“…Since the dot-product correlation contracts all channel dimensions of the query and the keys, we may lose semantic information, which may help in generating an effective relational kernel. We thus take the Hadamard product [23] instead so that we can leverage channel-wise query-key correlations for producing the relational kernel. Using a learnable kernel projection matrix H ∈ R M C×M , Eq.…”

Section: Our Approachmentioning

confidence: 99%

Relational Self-Attention: What's Missing in Attention for Video Understanding

Kim

Kwon

Wang³

et al. 2021

Preprint

View full text Add to dashboard Cite

Convolution has been arguably the most important feature transform for modern neural networks, leading to the advance of deep learning. Recent emergence of Transformer networks, which replace convolution layers with self-attention blocks, has revealed the limitation of stationary convolution kernels and opened the door to the era of dynamic feature transforms. The existing dynamic transforms, including self-attention, however, are all limited for video understanding where correspondence relations in space and time, i.e., motion information, are crucial for effective representation. In this work, we introduce a relational feature transform, dubbed the relational self-attention (RSA), that leverages rich structures of spatio-temporal relations in videos by dynamically generating relational kernels and aggregating relational contexts. Our experiments and ablation studies show that the RSA network substantially outperforms convolution and self-attention counterparts, achieving the state of the art on the standard motion-centric benchmarks for video action recognition, such as Something-Something-V1&V2, Diving48, and FineGym.

show abstract

“…After the encoding stage, image features and sequence features are first fused and then fed into the image decoder and the sequence decoder. Representative methods for multimodal feature fusion include Concat+MLP, MCB [Fukui et al 2016], MLB [Kim et al 2016] and MFB [Yu et al 2017]. Con-cat+MLP (Concatenation + Multi-Layer Perceptron), one of the most intuitive strategies, is adopted in our model, which is proven to be effective through our experiments.…”

Section: Multi-modality Representation Learningmentioning

confidence: 99%

DeepVecFont: Synthesizing High-quality Vector Fonts via Dual-modality Learning

Wang,

Lian

2021

Preprint

View full text Add to dashboard Cite

nnnnnnn features of fonts to synthesize vector glyphs. Second, we provide a new generative paradigm to handle unstructured data (e.g., vector glyphs) by randomly sampling plausible synthesis results to get the optimal one which is further refined under the guidance of generated structured data (e.g., glyph images). Finally, qualitative and quantitative experiments conducted on a publicly-available dataset demonstrate that our method obtains highquality synthesis results in the applications of vector font generation and interpolation, significantly outperforming the state of the art.

show abstract

Hadamard Product for Low-rank Bilinear Pooling

Cited by 95 publications

References 17 publications

A Survey on Multi-modal Summarization

A Survey on Multi-modal Summarization

Relational Self-Attention: What's Missing in Attention for Video Understanding

DeepVecFont: Synthesizing High-quality Vector Fonts via Dual-modality Learning

Contact Info

Product

Resources

About