Everything at Once -- Multi-modal Fusion Transformer for Video Retrieval

Shvetsova, Nina; Chen, Brian; Rouditchenko, Andrew; Thomas, Samuel; Kingsbury, Brian; Feris, Rogério; Harwath, David; Glass, James; Kuehne, Hilde

doi:10.48550/arxiv.2112.04446

Cited by 1 publication

(1 citation statement)

References 26 publications

(68 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Less common are approaches which learn self-supervised models with multiple modalities at once. One recent work in this direction is [46], which learns representations using audio, video and text. However, to avoid the collapse of the self-supervised loss, they feed the modalities two at a time, increasing the amount of necessary forward passes.…”

Section: Related Workmentioning

confidence: 99%

Zorro: the masked multimodal transformer

Recasens¹,

Lin²,

Carreira³

et al. 2023

Preprint

View full text Add to dashboard Cite

Attention-based models are appealing for multimodal processing because inputs from multiple modalities can be concatenated and fed to a single backbone network -thus requiring very little fusion engineering. The resulting representations are however fully entangled throughout the network, which may not always be desirable: in learning, contrastive audio-visual self-supervised learning requires independent audio and visual features to operate, otherwise learning collapses; in inference, evaluation of audio-visual models should be possible on benchmarks having just audio or just video. In this paper, we introduce Zorro, a technique that uses masks to control how inputs from each modality are routed inside Transformers, keeping some parts of the representation modality-pure. We apply this technique to three popular transformer-based architectures (ViT, Swin and HiP) and show that with contrastive pre-training Zorro achieves state-of-the-art results on most relevant benchmarks for multimodal tasks (AudioSet and VGGSound). Furthermore, the resulting models are able to perform unimodal inference on both video and audio benchmarks such as Kinetics-400 or ESC-50.

show abstract

Section: Related Workmentioning

confidence: 99%