2020
DOI: 10.48550/arxiv.2006.11747
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Weak Supervision and Referring Attention for Temporal-Textual Association Learning

Zhiyuan Fang,
Shu Kong,
Zhe Wang
et al.

Abstract: A man from UPS came and delivered the package.Later, a lady went in the car with the package. Video Moment RetrievalFigure 1: One practical application of the weakly supervised temporal textual association learning can be video moment retrieval (e.g., in surveillance video) using natural language query, which requires the video-level caption to align with the video segment temporally without any annotation.

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

0
4
0

Year Published

2020
2020
2022
2022

Publication Types

Select...
2
1

Relationship

1
2

Authors

Journals

citations
Cited by 3 publications
(4 citation statements)
references
References 47 publications
0
4
0
Order By: Relevance
“…We compare AutoTVG with other weakly-supervised counterparts by applying video moment generation method to Charades-STA and take the moment that has the maximum similarity with its caption as the final prediction. We observe that AutoTVG is able to achieve highly comparable results with CTF [7] and WSRA [11], and even surpasses SCN [29] by 4.2% at R@0.5 and CTF [7] by 1.1% at mIoU, which shows our video moment generation module provides reliable candidate moments. Thanks to the strong alignment between vision and language features from CLIP, we can obtain decent weakly-supervised results under a simple non-parametric matching strategy.…”
Section: Resultsmentioning
confidence: 64%
“…We compare AutoTVG with other weakly-supervised counterparts by applying video moment generation method to Charades-STA and take the moment that has the maximum similarity with its caption as the final prediction. We observe that AutoTVG is able to achieve highly comparable results with CTF [7] and WSRA [11], and even surpasses SCN [29] by 4.2% at R@0.5 and CTF [7] by 1.1% at mIoU, which shows our video moment generation module provides reliable candidate moments. Thanks to the strong alignment between vision and language features from CLIP, we can obtain decent weakly-supervised results under a simple non-parametric matching strategy.…”
Section: Resultsmentioning
confidence: 64%
“…For Weakly-Supervised Video Localization [7,25], due to the lack of accurate timestamp annotations, researchers try to promote models in learning the cross-modal correlation without the supervision of time stamps. To overcome this difficulty, many weakly supervised models use contrastive learning to use the similarity information between sentences and make full use of the combination information between video and text.…”
Section: Weakly Supervised Temporal Localizationmentioning
confidence: 99%
“…To demonstrate that learning more fine-grained phrase-level predictions is beneficial to improve model's generalization ability to new combinations of seen phrases (combinational generazation), we put forward a new dataset split on Charades-STA. Inspired by data splitting methods proposed in some weakly supervised settings [7,25], we aim to test the model's performance in this scenario: the data distributions of training set and testing set are different. We split the Charades-STA dataset as below to maximize the variance of phrases in the training section.…”
Section: Experiments Settingsmentioning
confidence: 99%
“…Kervadec et al [25] use weak supervision in the form of object-word alignment as a pre-training task, Trott et al [53] use the counts of objects in an image as weak supervision to guide VQA for countingbased questions, Gokhale et al [16] use rules about logical connectives to augment training datasets for yes-no questions, and Zhao et al [65] use word-embeddings [36] to design an additional weak-supervision objective. Weak supervision from captions has also been recently used for visual grounding tasks [19,37,12,4].…”
Section: Related Workmentioning
confidence: 99%