2023
DOI: 10.48550/arxiv.2303.16058
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Unmasked Teacher: Towards Training-Efficient Video Foundation Models

Abstract: Video Foundation Models (VFMs) have received limited exploration due to high computational costs and data scarcity. Previous VFMs rely on Image Foundation Models (IFMs), which face challenges in transferring to the video domain. Although VideoMAE has trained a robust ViT from limited data, its low-level reconstruction poses convergence difficulties and conflicts with high-level cross-modal alignment. This paper proposes a training-efficient method for temporal-sensitive VFMs that integrates the benefits of exi… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1

Citation Types

0
2
0

Year Published

2023
2023
2024
2024

Publication Types

Select...
1
1

Relationship

0
2

Authors

Journals

citations
Cited by 2 publications
(2 citation statements)
references
References 48 publications
0
2
0
Order By: Relevance
“…LLMs are able to provide semantic information and generate symbolic spatial signals, which can serve as guidance for video scene understanding. Recently this has been demonstrated for interactive video-based dialogue and conversation [42], [41], [83], [90], [46], [91]. In this context, Video-ChatGPT [42] is designed for video understanding and conversation by capturing the spatial-temporal relationships between video frames based on LLMs.…”
Section: B Video Scene Understandingmentioning
confidence: 99%
“…LLMs are able to provide semantic information and generate symbolic spatial signals, which can serve as guidance for video scene understanding. Recently this has been demonstrated for interactive video-based dialogue and conversation [42], [41], [83], [90], [46], [91]. In this context, Video-ChatGPT [42] is designed for video understanding and conversation by capturing the spatial-temporal relationships between video frames based on LLMs.…”
Section: B Video Scene Understandingmentioning
confidence: 99%
“…For this purpose, several video foundation models (InternVideo [6], mPLUG [7], UnMasked Teacher [8]) were analyzed within such machine learning tasks, which can be divided into two groups:…”
Section: Main Part Artificial Intelligence Based Proctoring Systems A...mentioning
confidence: 99%