Proceedings of the 30th ACM International Conference on Multimedia 2022
DOI: 10.1145/3503161.3548010
|View full text |Cite
|
Sign up to set email alerts
|

Boosting Video-Text Retrieval with Explicit High-Level Semantics

Abstract: Recently, masked video modeling has been widely explored and significantly improved the model's understanding ability of visual regions at a local level. However, existing methods usually adopt random masking and follow the same reconstruction paradigm to complete the masked regions, which do not leverage the correlations between cross-modal content. In this paper, we present MAsk for Semantics COmpleTion (MASCOT) based on semantic-based masked modeling. Specifically, after applying attention-based video maski… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2

Citation Types

0
2
0

Year Published

2023
2023
2024
2024

Publication Types

Select...
4
2
1

Relationship

0
7

Authors

Journals

citations
Cited by 10 publications
(2 citation statements)
references
References 63 publications
0
2
0
Order By: Relevance
“…To show the empirical efficiency of our SUMA, we train models on MSR-VTT (Xu et al, 2016), MSVD (Chen and Dolan, 2011), and Activi-tyNet (Fabian Caba Heilbron and Niebles, 2015). For a fair comparison, we only compare our methods with methods that are based on CLIP (Radford et al, 2021), i.e., Clip4Clip (Luo et al, 2022), CLIP2TV (Gao et al, 2021), X-CLIP , DiscreteCodebook (Liu et al, 2022a), TS2-Net (Liu et al, 2022b), CLIP2Video (Park et al, 2022), VCM , HiSE (Wang et al, 2022a), Align&Tell (Wang et al, 2022b), Center-CLIP (Zhao et al, 2022), and X-Pool (Gorti et al, 2022). Implementation details and evaluation protocols are deferred to the Appendix.…”
Section: Datasets and Baselinesmentioning
confidence: 99%
“…To show the empirical efficiency of our SUMA, we train models on MSR-VTT (Xu et al, 2016), MSVD (Chen and Dolan, 2011), and Activi-tyNet (Fabian Caba Heilbron and Niebles, 2015). For a fair comparison, we only compare our methods with methods that are based on CLIP (Radford et al, 2021), i.e., Clip4Clip (Luo et al, 2022), CLIP2TV (Gao et al, 2021), X-CLIP , DiscreteCodebook (Liu et al, 2022a), TS2-Net (Liu et al, 2022b), CLIP2Video (Park et al, 2022), VCM , HiSE (Wang et al, 2022a), Align&Tell (Wang et al, 2022b), Center-CLIP (Zhao et al, 2022), and X-Pool (Gorti et al, 2022). Implementation details and evaluation protocols are deferred to the Appendix.…”
Section: Datasets and Baselinesmentioning
confidence: 99%
“…To show the empirical efficiency of our S3MA, we train it on MSR-VTT (Xu et al, 2016), MSVD Dolan, 2011), andActivi-tyNet (Fabian Caba Heilbron andNiebles, 2015). We compare with VLM (Xu et al, 2021a), HERO (Li et al, 2020a), VideoCLIP (Xu et al, 2021b), EvO (Shvetsova et al, 2022), OA-Trans (Wang et al, 2022a), RaP (Wu et al, 2022), LiteVL , NCL (Park et al, 2022b), TABLE (Chen et al, 2023), VOP (Huang et al, 2023), Clip4Clip (Luo et al, 2022), X-CLIP (Ma et al, 2022a), DiscreteCodebook (Liu et al, 2022a), TS2-Net (Liu et al, 2022b), VCM , HiSE (Wang et al, 2022b), Align&Tell (Wang et al, 2022c), Center-CLIP (Zhao et al, 2022), andX-Pool (Gorti et al, 2022). Implementation details and evaluation protocols are deferred to the Appendix.…”
Section: Datasets and Baselinesmentioning
confidence: 99%