Proceedings of the 29th ACM International Conference on Multimedia 2021
DOI: 10.1145/3474085.3475431
|View full text |Cite
|
Sign up to set email alerts
|

Understanding Chinese Video and Language via Contrastive Multimodal Pre-Training

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1

Citation Types

0
14
0

Year Published

2022
2022
2024
2024

Publication Types

Select...
4
2
1

Relationship

0
7

Authors

Journals

citations
Cited by 30 publications
(14 citation statements)
references
References 18 publications
0
14
0
Order By: Relevance
“…The visual entailment result of ALBEF on SNLI-VE dataset is shown in TR@5 TR@10 IR@1 IR@5 IR@10 (b) The impact of 𝛼 2 It is shown that as 𝛼 1 > 0 and 𝛼 2 > 0, the attack performance becomes stronger. This demonstrates the importance of the second term in Equation ( 6) and Equation (7).…”
Section: Ablation Studymentioning
confidence: 81%
See 2 more Smart Citations
“…The visual entailment result of ALBEF on SNLI-VE dataset is shown in TR@5 TR@10 IR@1 IR@5 IR@10 (b) The impact of 𝛼 2 It is shown that as 𝛼 1 > 0 and 𝛼 2 > 0, the attack performance becomes stronger. This demonstrates the importance of the second term in Equation ( 6) and Equation (7).…”
Section: Ablation Studymentioning
confidence: 81%
“…It is shown that as 𝛼 1 > 0 and 𝛼 2 > 0, the attack performance becomes stronger. This demonstrates the importance of the second term in Equation (6) and Equation(7). The results are comparable when 𝛼 1 ≥ 1 and 𝛼 2 ≥ 1, indicating that Co-Attack is not sensitive to hyper-parameters and does not require elaborate tuning for the hyper-parameters.…”
mentioning
confidence: 80%
See 1 more Smart Citation
“…The studies of ClipBERT (Lei et al, 2021b) and Frozen (Bain et al, 2021) demonstrate that imagetext pre-training is effective in improving downstream video-text performance. Recent efforts in image-text modeling have also shown that, when scaled up to hundreds of millions (Radford et al, 2021) or even billions (Li et al, 2021a) of image-text pairs, image-text models can achieve stateof-the-art results on various video-text tasks, including text-to-video retrieval (Luo et al, 2021a;Yuan et al, 2021;Yu et al, 2022a), video question answering (Alayrac et al, 2022), and video captioning (Tang et al, 2021a;Wang et al, 2022d).…”
Section: Transferring Image-text Models To Video-text Tasksmentioning
confidence: 99%
“…Due to the missing of large-scale non-English video-text datasets for both pre-training and downstream evaluation, video-text models in non-English languages are less explored. As initial attempts, Lei et al (2021a)…”
Section: Multi-lingual Vlp For Video-text Tasksmentioning
confidence: 99%