2021 IEEE International Conference on Image Processing (ICIP) 2021
DOI: 10.1109/icip42928.2021.9506218
|View full text |Cite
|
Sign up to set email alerts
|

Weakly-Supervised Moment Retrieval Network for Video Corpus Moment Retrieval

Abstract: This paper proposes Weakly-supervised Moment Retrieval Network (WMRN) for Video Corpus Moment Retrieval (VCMR), which retrieves pertinent temporal moments related to natural language query in a large video corpus. Previous methods for VCMR require full supervision of temporal boundary information for training, which involves a labor-intensive process of annotating the boundaries in a large number of videos. To leverage this, the proposed WMRN performs VCMR in a weakly-supervised manner, where WMRN is learned w… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1

Citation Types

0
1
0

Year Published

2021
2021
2024
2024

Publication Types

Select...
2
2

Relationship

1
3

Authors

Journals

citations
Cited by 4 publications
(3 citation statements)
references
References 14 publications
0
1
0
Order By: Relevance
“…The green curve denotes the L ar with L rub and it shows further optimizing compared to without L rub as the L g(e) ar decreases. This denotes that neural networks can be further optimized according to the training epochs by calibrating their training objectives, which is also validated in other multi-modal systems (Yoon et al, 2023;Zheng et al, 2022) in other ways.…”
Section: Ablation Studymentioning
confidence: 76%
“…The green curve denotes the L ar with L rub and it shows further optimizing compared to without L rub as the L g(e) ar decreases. This denotes that neural networks can be further optimized according to the training epochs by calibrating their training objectives, which is also validated in other multi-modal systems (Yoon et al, 2023;Zheng et al, 2022) in other ways.…”
Section: Ablation Studymentioning
confidence: 76%
“…Thus our future work is to build a dataset to perform a more general format of RHL tasks by building real environmental data under more diverse conditions such as the co-occurrence of human and outdoor environments. Furthermore, we also consider extending the work of the current training framework of CLNet to be performed in weakly-supervised settings [26], [27], which mitigates the reliance on temporal annotations to train localization in MD signatures.…”
Section: Limitationmentioning
confidence: 99%
“…For the video encoder in grounding model, we follow previous methods [10], [61] to utilize I3D [67] for C-STA and C3D [68] for ANC. The features are extracted by downsampling each video at a rate of 8, and the maximum video segments is set as 200.…”
Section: B Implementation Detailsmentioning
confidence: 99%