2021
DOI: 10.48550/arxiv.2112.04446
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Everything at Once -- Multi-modal Fusion Transformer for Video Retrieval

Abstract: Multi-modal learning from video data has seen increased attention recently as it allows to train semantically meaningful embeddings without human annotation enabling tasks like zero-shot retrieval and classification. In this work, we present a multi-modal, modality agnostic fusion transformer approach that learns to exchange information between multiple modalities, such as video, audio, and text, and integrate them into a joined multi-modal representation to obtain an embedding that aggregates multi-modal temp… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1

Citation Types

0
1
0

Year Published

2023
2023
2023
2023

Publication Types

Select...
1

Relationship

0
1

Authors

Journals

citations
Cited by 1 publication
(1 citation statement)
references
References 26 publications
(68 reference statements)
0
1
0
Order By: Relevance
“…Less common are approaches which learn self-supervised models with multiple modalities at once. One recent work in this direction is [46], which learns representations using audio, video and text. However, to avoid the collapse of the self-supervised loss, they feed the modalities two at a time, increasing the amount of necessary forward passes.…”
Section: Related Workmentioning
confidence: 99%
“…Less common are approaches which learn self-supervised models with multiple modalities at once. One recent work in this direction is [46], which learns representations using audio, video and text. However, to avoid the collapse of the self-supervised loss, they feed the modalities two at a time, increasing the amount of necessary forward passes.…”
Section: Related Workmentioning
confidence: 99%